Author Topic: ARM 32F4xx versus Cray 1  (Read 2925 times)

0 Members and 1 Guest are viewing this topic.

Offline peter-h

  • Super Contributor
  • ***
  • Posts: 1664
  • Country: gb
  • Doing electronics since the 1960s...
ARM 32F4xx versus Cray 1
« on: December 10, 2020, 04:14:19 pm »
Both run at 150MHz and both have hardware floating point :)

The Cray 1 had four cores and probably lots of other goodies which optimised it for things like crypto, but what were they? Did it have custom microcode capability?

Of course a clock cycle is not like every other clock cycle. The old micros (Z80, etc) needed 3-4 clocks minimum per instruction. I don't know how many the ARM needs but (I am doing some software on it now, in C) it runs very roughly 10x faster than a 16MHz Z180 used to, so I doubt it is doing 1 instruction per clock. I think modern PC CPUs do most instructions in one clock cycle. OTOH the ARM is "RICS" while an 80x86 has vastly more opcode options. The Cray, I don't know.

I recall Apple having made a load of PR out of buying a Cray 1 for their software development department. Was this complete nonsense in reality? I imagine there was a C compiler for it...
Z80 Z180 Z280 Z8 S8 8031 8051 H8/300 H8/500 90S1200 32F417
 

Offline Yansi

  • Super Contributor
  • ***
  • Posts: 3828
  • Country: 00
  • STM32, STM8, AVR, 8051
Re: ARM 32F4xx versus Cray 1
« Reply #1 on: December 10, 2020, 04:27:55 pm »
F4 also can have HW crypto.

And even though, both run 150MHz (which is only half the truth with both), is like comparing apples and oranges.
 

Offline peter-h

  • Super Contributor
  • ***
  • Posts: 1664
  • Country: gb
  • Doing electronics since the 1960s...
Re: ARM 32F4xx versus Cray 1
« Reply #2 on: December 10, 2020, 06:01:19 pm »
Yes the 407 has hardware RSA and AES256.

I was looking for more than a 1-line answer though :)
Z80 Z180 Z280 Z8 S8 8031 8051 H8/300 H8/500 90S1200 32F417
 
The following users thanked this post: RichardS

Offline fcb

  • Super Contributor
  • ***
  • Posts: 2103
  • Country: gb
  • Test instrument designer/G1YWC
    • Electron Plus
Re: ARM 32F4xx versus Cray 1
« Reply #3 on: December 10, 2020, 06:14:18 pm »
You can sit on the leatherette powersupplies of the Cray-1.
https://electron.plus Power Analysers, VI Signature Testers, Voltage References, Picoammeters, Curve Tracers.
 
The following users thanked this post: newbrain, harerod

Offline Yansi

  • Super Contributor
  • ***
  • Posts: 3828
  • Country: 00
  • STM32, STM8, AVR, 8051
Re: ARM 32F4xx versus Cray 1
« Reply #4 on: December 10, 2020, 06:18:10 pm »
Yes the 407 has hardware RSA and AES256.

I was looking for more than a 1-line answer though :)

407 does not, but 417 does, if examples need to be given.

Also what answer would you like to hear? Just told you, you're trying to compare apples and oranges.

If you want to compare something, set the constraints to what to compare under which conditions. Or we end up quoting datasheets of both and comparing their dick lengths of everything.
« Last Edit: December 10, 2020, 06:20:48 pm by Yansi »
 

Offline CJay

  • Super Contributor
  • ***
  • Posts: 4027
  • Country: gb
Re: ARM 32F4xx versus Cray 1
« Reply #5 on: December 10, 2020, 08:19:55 pm »
You can sit on the leatherette powersupplies of the Cray-1.

I prefer the Cray-2, sure you lose out on the pleather seating but you get a relaxing water feature.
 

Offline peter-h

  • Super Contributor
  • ***
  • Posts: 1664
  • Country: gb
  • Doing electronics since the 1960s...
Re: ARM 32F4xx versus Cray 1
« Reply #6 on: December 11, 2020, 10:03:24 am »
Is anyone here familiar with the Cray 1 processor and whether it had any application specific features, or was it just fast for its time (150MHz in 1975 was fast - 10x faster than mainframes).
Z80 Z180 Z280 Z8 S8 8031 8051 H8/300 H8/500 90S1200 32F417
 

Offline Kjelt

  • Super Contributor
  • ***
  • Posts: 6228
  • Country: nl
Re: ARM 32F4xx versus Cray 1
« Reply #7 on: December 11, 2020, 04:25:20 pm »
Both run at 150MHz and both have hardware floating point :)
First Cray 1 ran at 80MHz and did 150MFlops about the same as the F4 I guesstimate.
But I concur with the apples and oranges statement above. The Cray is besides a supercomputer an excellent heater and paperweight  :)

https://en.wikipedia.org/wiki/Cray-1

Quote
I don't know how many the ARM needs but (I am doing some software on it now, in C) it runs very roughly 10x faster than a 16MHz Z180 used to, so I doubt it is doing 1 instruction per clock.
Most instructions take 1 cycle some more IIRC but the problem is the memory fetch, most embedded microcontrollers have slow flash as ro memory and depending on manufacturer and clockfrequency this can take a couple of cycles penalty.
If you tranfer your program to internal sram and execute from that you see much better performance.
 
The following users thanked this post: I wanted a rude username

Online bdunham7

  • Super Contributor
  • ***
  • Posts: 4945
  • Country: us
Re: ARM 32F4xx versus Cray 1
« Reply #8 on: December 11, 2020, 04:38:29 pm »
Is anyone here familiar with the Cray 1 processor and whether it had any application specific features, or was it just fast for its time (150MHz in 1975 was fast - 10x faster than mainframes).

I believe it was the first implementation of a vector processor, sort of an early SSE/AVX.
A 3.5 digit 4.5 digit 5 digit 5.5 digit 6.5 digit 7.5 digit DMM is good enough for most people.
 

Offline rstofer

  • Super Contributor
  • ***
  • Posts: 9105
  • Country: us
Re: ARM 32F4xx versus Cray 1
« Reply #9 on: December 11, 2020, 11:37:50 pm »
Wiki has everything:

https://en.wikipedia.org/wiki/Cray-1

There were many specialized units intended to support things like vector processing.  One instruction to multiply 1 million cells of a pair of vectors.  That save 999,999 fetches!  Plus whatever it might require for indexing.

Seymour Cray always provided multiple FPUs and those worked pretty fast, for the period.

The Cray 1 might be good for 150 MFlops while a PC might be good for 50 GFlops.  Using a GPU, it might be 1500 GFlops of 32 bit or 1400 GFlops of 64 bit floating point.  About 9000 times faster than a Cray 1 for 64 bit FP.

Those old mainframes were interesting - at the time...
 

Offline peter-h

  • Super Contributor
  • ***
  • Posts: 1664
  • Country: gb
  • Doing electronics since the 1960s...
Re: ARM 32F4xx versus Cray 1
« Reply #10 on: December 12, 2020, 09:58:30 am »
I would be amazed if the 32F4xx did 1 FLOP per clock cycle. I know one can have a 32 bit barrel shifter and that does a fast multiply but how would you do division in 1 clock cycle?

Thanks for the interesting input re copying code to RAM. Unfortunately there isn't a whole lot of RAM to play with, so one could do this only with little bits of code. Whis is all that's needed.
Z80 Z180 Z280 Z8 S8 8031 8051 H8/300 H8/500 90S1200 32F417
 

Offline Yansi

  • Super Contributor
  • ***
  • Posts: 3828
  • Country: 00
  • STM32, STM8, AVR, 8051
Re: ARM 32F4xx versus Cray 1
« Reply #11 on: December 12, 2020, 10:59:52 am »
https://developer.arm.com/documentation/100166/0001/Floating-Point-Unit/FPU-functional-description/FPU-instruction-set-table?lang=en

Learn for yourself, how many cycles do the FPU instructions take. Most of the basic ones are single cycle  ;)
 

Offline peter-h

  • Super Contributor
  • ***
  • Posts: 1664
  • Country: gb
  • Doing electronics since the 1960s...
Re: ARM 32F4xx versus Cray 1
« Reply #12 on: December 12, 2020, 12:39:10 pm »
Thanks. It's just like I said :)



14 clocks.

The only way to do it in 1 clock is by generating additional edges with delay lines. That was how a lot of stuff was done in the old days. I used those chips - the schottky ones used to run pretty hot.

Still, to do it in 14 clocks means they are doubling-up i.e. using both +ve and -ve edges, or something similar.
Z80 Z180 Z280 Z8 S8 8031 8051 H8/300 H8/500 90S1200 32F417
 

Offline Yansi

  • Super Contributor
  • ***
  • Posts: 3828
  • Country: 00
  • STM32, STM8, AVR, 8051
Re: ARM 32F4xx versus Cray 1
« Reply #13 on: December 12, 2020, 01:49:21 pm »
So what did you say? You're just selectively picking stuff.

What do you claim by that? That Cray1 could do faster float divide? Or that float divide can not be done single cycle? (of course, it can!)

Least but not last, divide gets rare use in practical application.  Multiply-accumulate is what gets the most stuff done.
 
The following users thanked this post: Kjelt, Jacon

Offline peter-h

  • Super Contributor
  • ***
  • Posts: 1664
  • Country: gb
  • Doing electronics since the 1960s...
Re: ARM 32F4xx versus Cray 1
« Reply #14 on: December 12, 2020, 08:38:47 pm »
How can float divide be done in a single cycle?
Z80 Z180 Z280 Z8 S8 8031 8051 H8/300 H8/500 90S1200 32F417
 

Offline Yansi

  • Super Contributor
  • ***
  • Posts: 3828
  • Country: 00
  • STM32, STM8, AVR, 8051
Re: ARM 32F4xx versus Cray 1
« Reply #15 on: December 13, 2020, 09:50:26 am »
Of course it can be. You just need to extend the pipeline enough, so that all other cycles are covered there.

And as I've said, divide is not what is really that much used or needed, it is not worth all the hassle to be pipelined so much.

Even if you'd look closer, Cortex M4 already can do an out of order execution (and Cortex M7 is even superscalar), so if the compiler is clever, the divide will not take 14 cycles, but less - other cycles will be in parallel with other non-FPU instructions.
 

Offline Doctorandus_P

  • Super Contributor
  • ***
  • Posts: 2068
  • Country: nl
Re: ARM 32F4xx versus Cray 1
« Reply #16 on: January 06, 2021, 11:31:38 am »
Have you done any research on the Cray1? For example:
https://en.wikipedia.org/wiki/Cray-1

It was by no means a general purpose computer. It was very good at some vector operations and a lot of programs (libraries) were probably hand optimized asm for that architecture.
 

Online westfw

  • Super Contributor
  • ***
  • Posts: 3716
  • Country: us
Re: ARM 32F4xx versus Cray 1
« Reply #17 on: January 07, 2021, 06:34:25 am »
I took (part of) a Cray-1 assembly language class, back in the mid-1980s.  Old-coot Stanford Prof was like "this is the only way to get maximum performance out of a Cray!"
One of the things I remember was that the main memory was highly-interleaved (I recall 8 banks, but WP says up to 16), so if you organized your data correctly, you could access 512bits worth of operands at essentially full speed (50ns memory, ~12ns CPU Clock...)  That's a total memory bandwidth that would be considered pretty good even in a modern system...
 

Offline Berni

  • Super Contributor
  • ***
  • Posts: 4106
  • Country: si
Re: ARM 32F4xx versus Cray 1
« Reply #18 on: January 07, 2021, 08:25:11 am »
It can be tricky directly comparing two very different architectures. One of the other may be faster depending on what you take as a metric.

For DSP computation performance a good benchmark is FFT.
Here is a FFT implementation on the Cray 1: https://www.ecmwf.int/file/24115/download?token=s6ilqSrl
So acording to the table it gets in compiled fortran for a single FFT:
N32: 157us
N64: 302us
N128: 736us
N1024: 7477us
But if you calculate 128 FFTs in parallel to fill up its wide datapaths and using a more efficient CAL compiler the average equivalent time per FFT is:
N32: 7us
N64: 15us
N128: 37us
N1024: 415us

And if we look at a modern ARM, say a CortexM4 at 180MHz in the form of a NXP Kinetis K66:
http://openaudio.blogspot.com/2016/09/benchmarking-fft-speed.html
N32: 40000 per second = 25 us
N64: 21227 per second = 47 us
N128: 9524 per second = 105 us

This is with a library that is assembler optimised for ARM, in the generic architecture independant C implementation of KissFFT the speed is about 50% of that.

So if you take the first table then ARM is significantly faster, but if you take the second table Cray 1 is way faster. If you do the same test with fixed point FFT the results are again very different.
 
The following users thanked this post: I wanted a rude username

Offline jmelson

  • Super Contributor
  • ***
  • Posts: 2395
  • Country: us
Re: ARM 32F4xx versus Cray 1
« Reply #19 on: March 05, 2021, 03:12:09 am »
Both run at 150MHz and both have hardware floating point :)

The Cray 1 had four cores and probably lots of other goodies which optimised it for things like crypto, but what were they? Did it have custom microcode capability?

I believe the Cray 1 clock was 80 MHz, or some multiple of that.  It did not have "four cores".  It was a single processor, but had a number of features for processing multiple instructions at one time.  Maybe your four cores refers to having two adders and two multipliers that could all run in parallel.  Ther was no microcode at all, it was entirely hardwired.  One feature was the vector registers.  This allowed an array operation to be set up, with a source and destination address counter, that had "stride", so it could skip so many words each cycle.  This meant that an entire row of an array multiply could be set up and executed as a single instruction.  The operation would be decoded once, and then the vector register would perform that operation the number of times specified, while stepping through the addresses as required.  It could do a multiply-accumulate operation at 80 MFLOPS.

The Cray I was not really designed for crypto work, but for things like computational fluid dynamics, chip simulation, mechanical finite element analysis, and something related to nuclear weapons simulation that is likely still classified.

Jon
 

Offline jmelson

  • Super Contributor
  • ***
  • Posts: 2395
  • Country: us
Re: ARM 32F4xx versus Cray 1
« Reply #20 on: March 05, 2021, 03:17:30 am »
I believe it was the first implementation of a vector processor, sort of an early SSE/AVX.
No, the CDC Star (also at least partly designed by Seymour Cray) had rather similar vector registers.

Jon
 

Offline jmelson

  • Super Contributor
  • ***
  • Posts: 2395
  • Country: us
Re: ARM 32F4xx versus Cray 1
« Reply #21 on: March 05, 2021, 03:22:31 am »
So what did you say? You're just selectively picking stuff.

What do you claim by that? That Cray1 could do faster float divide? Or that float divide can not be done single cycle? (of course, it can!)
The Cray I did not have a divide instruction.  It was done by reciprocate and then multiply.  It did have instructions for this.
Quote
Least but not last, divide gets rare use in practical application.  Multiply-accumulate is what gets the most stuff done.
Yes, that's why Cray decided to not implement divide.  He had a set of classic algorithms he used to decide what the Cray I needed to do fast, and what could be left out or done other ways.

Jon
 

Online Sal Ammoniac

  • Super Contributor
  • ***
  • Posts: 1492
  • Country: us
Re: ARM 32F4xx versus Cray 1
« Reply #22 on: March 02, 2022, 07:14:04 am »
I recall Apple having made a load of PR out of buying a Cray 1 for their software development department.

Yes, Apple bought a Cray to design Macs, and Seymour, not to be outdone, said that he used a Mac to design Crays.
Complexity is the number-one enemy of high-quality code.
 

Online hamster_nz

  • Super Contributor
  • ***
  • Posts: 2652
  • Country: nz
Re: ARM 32F4xx versus Cray 1
« Reply #23 on: March 02, 2022, 07:39:59 am »
Is anyone here familiar with the Cray 1 processor and whether it had any application specific features, or was it just fast for its time (150MHz in 1975 was fast - 10x faster than mainframes).

It was a vector processor. This is the document you want to read

https://pages.cs.wisc.edu/~markhill/restricted/cacm78_cray1.pdf
Gaze not into the abyss, lest you become recognized as an abyss domain expert, and they expect you keep gazing into the damn thing.
 

Online DiTBho

  • Super Contributor
  • ***
  • Posts: 1600
  • Country: gb
Re: ARM 32F4xx versus Cray 1
« Reply #24 on: March 02, 2022, 10:01:37 am »
I have a simulator here written by someone, but it never worked and I gave up on it years ago :o
 


Share me

Digg  Facebook  SlashDot  Delicious  Technorati  Twitter  Google  Yahoo
Smf