EEVblog Electronics Community Forum

Products => Computers => Programming => Topic started by: SiliconWizard on November 05, 2022, 08:42:37 pm

Title: Fun simple exercise: median value
Post by: SiliconWizard on November 05, 2022, 08:42:37 pm
Can you think of the most optimized way of writing a function that returns the median value of 3 numbers? (That has to be signed numbers, can be integers or floating point, as you prefer.)

The median value of 3 numbers is not the average - it's the number among the 3 that sits in the middle. For instance: { 2, -3, 7 } -> the median value is 2.

The idea to judge the result would be to give the resulting assembly (using https://godbolt.org/ for instance.) The language itself can be anything you like, but it obviously has to be a compiled language to be able to see the resulting assembly. You can use the optimization options that yield the best result (at least as far as you can tell, may not be fully obvious from assembly on some targets what would yield the shorter execution time.)

You could also write it directly in assembly for the target of your choice, and see if you can do it better than optimizing compilers.

Can you manage to do this without any branching? If so, on which target(s)?
Title: Re: Fun simple exercise: median value
Post by: MK14 on November 05, 2022, 11:19:45 pm
I'm not sure how well optimized it is, compared to other peoples efforts, or if it needs more test cases (i.e. is fully working).  If MIN/MAX instructions are used (by the optimizer), instead of branches and/or intrinsics are used for MIN/MAX, if architecture supports it.  Then it could be branch free.

Code: [Select]
#include <stdio.h>
#include <stdlib.h>

#define max(a,b) ((a) > (b) ? (a) : (b))
#define min(a,b) ((a) < (b) ? (a) : (b))

int median(int x, int y, int z) {
  return min(max(x,z),min(max(x,y),max(y,z)));
}

int main()
{
  printf("2,-3,7 Answer = %d\n", median(2,-3,7));
  printf("-3,7,2 Answer = %d\n", median(-3,7,2));
  printf("7,2,-3 Answer = %d\n", median(7,2,-3));
  printf("-3, -3, 7 Answer = %d\n", median(-3,-3,7));
}

Results:

Code: [Select]
2,-3,7 Answer = 2
-3,7,2 Answer = 2
7,2,-3 Answer = 2
-3, -3, 7 Answer = -3

EDIT: The (X86_64) -O3 output, seems to be branch free.  I've tried to isolate, only the function itself.

Code: [Select]
median:
 cmp    edx,esi
 mov    eax,esi
 cmovge eax,edx
 cmp    edx,edi
 cmovl  edx,edi
 cmp    eax,edx
 cmovg  eax,edx
 cmp    esi,edi
 cmovl  esi,edi
 cmp    eax,esi
 cmovg  eax,esi
 ret   
Title: Re: Fun simple exercise: median value
Post by: SiliconWizard on November 06, 2022, 12:54:21 am
This one isn't bad. It is branch free indeed.

I have come up with this, which does a bit better, both for int and double:
(I've made it "generic" to make it easier to "instantiate" for different types.)
Code: [Select]
#define SWAPIFGREATER(a, b)         if ((a)>(b)) { typeof(a) temp = (a); (a) = (b); (b) = temp; }
#define MEDIAN3FUNC(name, type)     type name(type x1, type x2, type x3)

MEDIAN3FUNC(Median3i, int)
{
    SWAPIFGREATER(x1, x2)
    SWAPIFGREATER(x2, x3)
    SWAPIFGREATER(x1, x2)

return x2;
}
Code: [Select]
Median3i:
        cmp     edi, esi
        mov     eax, esi
        cmovg   eax, edi
        cmovl   esi, edi
        cmp     eax, edx
        cmovge  eax, edx
        cmp     esi, eax
        cmovg   eax, esi
        ret

It's basically a bubble sort, taking the middle value in the end. That it does a bit better may be due to the fact that the 3 steps are identical which may make it easier for the optimizer to determine a pattern.

Used for double, it gets even better due to the use of min/max FP instructions:
Code: [Select]
Median3d2:
        movapd  xmm3, xmm0
        maxsd   xmm3, xmm1
        minsd   xmm1, xmm0
        minsd   xmm2, xmm3
        maxsd   xmm1, xmm2
        movapd  xmm0, xmm1
        ret

Your own version in double yields this:
Code: [Select]
median:
        movapd  xmm3, xmm0
        maxsd   xmm3, xmm2
        maxsd   xmm0, xmm1
        maxsd   xmm1, xmm2
        minsd   xmm0, xmm1
        minsd   xmm3, xmm0
        movapd  xmm0, xmm3
        ret
Which is very similar, but with one more instruction. Again I suspect here the gain for mine is due to using 3 times the same pattern.
Title: Re: Fun simple exercise: median value
Post by: Nominal Animal on November 06, 2022, 04:51:48 am
I don't like relying on a specific version of a specific compiler, so I'll go straight into la-la land:
Code: [Select]
#define DEFINE_MEDIAN3(name, type)  \
    type name(type a, type b, type c) { return ((type [5]){b, c, a, c, b})[((a>b)<<1) + (a>c) + (b>c)]; }
DEFINE_MEDIAN3(Median3i, int)

It's just the array form of the comparison tree,
Code: [Select]
#define DEFINE_MEDIAN3(name, type) \
    type name(type a, type b, type c) { \
    const int  ab = (a > b), ac = (a > c), bc = (b > c); \
    return !(ab ^ bc) ? b : (ab ^ ac) ? a : c; }
DEFINE_MEDIAN3(Median3i, int)

If we explore these (and what MK14 and SiliconWizard posted above), we'll notice how different code clang and gcc generate.  The last variant above would be useful when compiling to rv32gc using clang, for example.  However, we notice that while the conditional swap tree and the minimum comparisons should yield very similar ASTs, the compilers generate completely different code.

Our compilers still optimize common code best.  Swap-if-pair-in-wrong-order is particularly common (although gcc still doesn't use cmovcc for this), so it is not a surprise that clang compiles the best code for the three-swap bubble sort; it is very common.  In particular, many quicksort implementations use median-of-three to select the pivot.



To explain the above two functions, consider the three-element bubble sort.  The three tests, the resulting median, and the expressions I used are
    a>b a>c b>c median !((a>b)^(b>c)) ((a>b)^(a>c))
     0   0   0    b           1             0
     0   0   1    c           0             0
     0   1   0    -
     0   1   1    a           0             1
     1   0   0    a           0             1
     1   0   1    -
     1   1   0    c           0             0
     1   1   1    b           1             0
where the two dashed cases are logically impossible and thus irrelevant: If a<=b and b<=c, then a>c is impossible; and if a>b and b>c, then a>c is impossible.

Anyway, this just shows that when you want the median, it is simplest to sort it and pick the middle element.  For example, when I do microbenchmarking, I often use qsort() to sort the durations of the runs (first moving the very first one to just past the array, since sometimes the first run time can be informative), and then use several of the entries, not just the median in the middle.  That lets me say things like "at least 75% of cases, the code takes duration[3*N/4] or less".  Of particularly useful are the 90% and 95% cases, in my opinion.

There is quickselect (https://en.wikipedia.org/wiki/Quickselect), of course, when the array is large and you don't want to sort all of it.
Title: Re: Fun simple exercise: median value
Post by: Nominal Animal on November 09, 2022, 04:56:49 am
Did I kill the topic? :o

Sorry...  :-[
Title: Re: Fun simple exercise: median value
Post by: MK14 on November 09, 2022, 03:20:13 pm
Did I kill the topic? :o

Sorry...  :-[

There are complications, for whatever might be the most optimal solution, gradually coming up, in theory, for the future.

Currently, and assuming we are talking about a high end, modern, full sized CPU, such as Desktops and Servers.  Because the CPU can do so many multiple instructions, in the same clock-cycle.  The fastest (wall clock) algorithm, may not actually be the most optimal, in terms of minimization of the operations required.
For example, because the shortest/quickest LOOKING algorithm in theory, may have dependencies, which hinder the out of order execution, to need a number of extra cycles.  Similarly, the issue can be with branches (predictors) and/or caches.

When/if FPGA (on top of the conventional CPU, or similarly) equipped CPUs become mainstream.  They also would/might allow, the creation of single instructions, which can perform the median{3} operation, in just a cycle (or few).

Smarter memories (content addressable memory, possibly with extra enhancements), talked about ages ago, may eventually become a mainsteam reality.  Which could be smart enough to ask/request from them, the median value (as well as the smallest, biggest, or ones, with a specific content filed match), automatically and at very high speed.

If you are a big enough customer.  I've heard (strong rumor) that you can request that Intel add your new instruction (median3, extended instruction), as either part of the published new, future instruction sets, or as a hidden feature, which can be made available for that special/big customer.

If a significant number of things need median3, then it could get added, anyway.  Sooner or later.
Title: Re: Fun simple exercise: median value
Post by: westfw on November 15, 2022, 12:54:53 am
Quote
    SWAPIFGREATER(x1, x2)
    SWAPIFGREATER(x2, x3)
    SWAPIFGREATER(x1, x2)
On a PDP10, I think that'd become:
Code: [Select]
camle x1, x2   ; compare and skip is less or equal
  exch x1,x2   ; exchange
camle x2, x3
  exch x2, x3
camle x1, x2
 exch x1, x2
;; result in in x2
I'm wondering whether you could squeeze another instruction or two by stacking the skip instructions, ala "TRCEs in Triples" (GLS: "My Favorite Hacks"), but I don't actually have a PDP10 to try it out on, so I'll pass.
Title: Re: Fun simple exercise: median value
Post by: MK14 on November 15, 2022, 01:29:14 am
Quote
    SWAPIFGREATER(x1, x2)
    SWAPIFGREATER(x2, x3)
    SWAPIFGREATER(x1, x2)
On a PDP10, I think that'd become:
Code: [Select]
camle x1, x2   ; compare and skip is less or equal
  exch x1,x2   ; exchange
camle x2, x3
  exch x2, x3
camle x1, x2
 exch x1, x2
;; result in in x2
I'm wondering whether you could squeeze another instruction or two by stacking the skip instructions, ala "TRCEs in Triples" (GLS: "My Favorite Hacks"), but I don't actually have a PDP10 to try it out on, so I'll pass.

You can run it (PDP-10, such as SIMH) using emulators, I believe.  Complete with TOPS-10 or TOPS-20, the OS.  With its slow by todays standard, 1 MIPs/MFlops performance, approximately.
At least you wouldn't have to share that mainframe, with a big number of other users (typically).

I'm surprised, such an old archetecture, has such powerful instructions.
Title: Re: Fun simple exercise: median value
Post by: SiliconWizard on November 15, 2022, 02:07:42 am
Quote
    SWAPIFGREATER(x1, x2)
    SWAPIFGREATER(x2, x3)
    SWAPIFGREATER(x1, x2)
On a PDP10, I think that'd become:
Code: [Select]
camle x1, x2   ; compare and skip is less or equal
  exch x1,x2   ; exchange
camle x2, x3
  exch x2, x3
camle x1, x2
 exch x1, x2
;; result in in x2
I'm wondering whether you could squeeze another instruction or two by stacking the skip instructions, ala "TRCEs in Triples" (GLS: "My Favorite Hacks"), but I don't actually have a PDP10 to try it out on, so I'll pass.

You can run it (PDP-10, such as SIMH) using emulators, I believe.  Complete with TOPS-10 or TOPS-20, the OS.  With its slow by todays standard, 1 MIPs/MFlops performance, approximately.
At least you wouldn't have to share that mainframe, with a big number of other users (typically).

I'm surprised, such an old archetecture, has such powerful instructions.

You should have a look at docs for the Apollo computer.
Title: Re: Fun simple exercise: median value
Post by: MK14 on November 15, 2022, 02:15:31 am
You should have a look at docs for the Apollo computer.

Thanks!   :)

I enjoyed seeing (quite a while ago) the Apollo computer details (all Nor-gates, if I remember correctly, very clever architecture, especially for the time.  Especially well suited for real-time software.).  In a surprisingly small box, for the time period.

But now you mention it.  I should revisit it.

If I remember right, the PDP-8, has a rather/very simple architecture.  Because it was all transistor, and any extra complexity, would involve more of the (very, at that time) expensive transistors.  Hence, why they tried to keep it so simple.
Title: Re: Fun simple exercise: median value
Post by: westfw on November 15, 2022, 06:35:51 am
Quote
I'm surprised, such an old archetecture, has such powerful instructions.
Not so powerful, really.  "skip" instructions for comparisons weren't all that rare, and the PDP10 has ... quite a full set CAMxx is the "compare AC (one of 16) with memory (registers also addressable as memory)" set, and all the possible choices of when to skip are there (including always and never.)  It's not like the skips were appreciably faster than the jump instructions (although the flag-based conditional jumps were rarely used, except for the nop "JFCL" (Jump on flags and clear them, except: no flags, and don't jump) (the fastest nop on KA and KL versions.)And as you said, all at about 1 MIPS.  Wonderfully regular instructions, though - the joy of a 9bit opcode!

Quote
You can run it (PDP-10, such as SIMH) using emulators
I'm aware.  A relatively modest PC running one of the emulators is apparently "many" times faster than the fastest PDP10 that DEC ever shipped.  Although I'm not sure how the "relatively intelligent" pager and cache and such play into that.
I have been resisting. (successfully.  For a long time now!)  The past is in the past.

Title: Re: Fun simple exercise: median value
Post by: MK14 on November 15, 2022, 07:06:46 am
I'm aware.  A relatively modest PC running one of the emulators is apparently "many" times faster than the fastest PDP10 that DEC ever shipped.  Although I'm not sure how the "relatively intelligent" pager and cache and such play into that.
I have been resisting. (successfully.  For a long time now!)  The past is in the past.

I also, don't like the potential speed error, which I would find rather/very annoying.

One possible solution, is to run it on something like a Raspberry PI 2 (which is fairly often possible).  That then has a tendency, to force the emulator, to run at something, hopefully, much closer to (what was), the real-life (realistic) speed.  Because the Raspberry PI 2, is so much weaker than the later Raspberry PI 4, and especially modern day PCs.

There is a different mainframe computer, which I'm keen to try (in emulation).  Which runs Multics (  https://en.wikipedia.org/wiki/Multics  ).  But by running it on the Raspberry PI 2, the emulator, runs vaguely close (or better) to the original mainframe computers speed.  According to people, who use to regularly use the real mainframe computer, Multics runs on.
When using the emulator for it ( e.g.  https://multicians.org/simulator.html  ).
According to a benchmark program, which can be run on it.

It is educationally interesting, because Multics came out before Unix, and may have created some of the OS concepts, that more modern operating systems, copied, in the future.

Also, the emulator, comes with quite a selection of included software (compilers), for quite a lot of different programming languages (of that era).

I like emulators.  Because they can let me play with various vintage computers, such as mainframes.
Title: Re: Fun simple exercise: median value
Post by: Nominal Animal on November 15, 2022, 08:03:35 am
The talk above about conditional moves and such caused me to explore a bit more, and I found out that
Code: [Select]
#define DEFINE_MEDIAN3(name, type) \
    type name(const type a, const type b, const type c) { \
        const type a1 = (a < b) ? a : b, \
                   b1 = (a < b) ? b : a; \
        const type c1 = (a1 < c) ? c : a1; \
        return (b1 < c1) ? b1 : c1; \
    }
DEFINE_MEDIAN3(Median3d, double)
DEFINE_MEDIAN3(Median3f, float)
DEFINE_MEDIAN3(Median3l, long)
DEFINE_MEDIAN3(Median3i, int)
DEFINE_MEDIAN3(Median3c, char)
DEFINE_MEDIAN3(Median3s, short)
generates near-optimal conditional move assembly as previously discussed, for x86-64 on both gcc and clang.  char and short tends to generate additional code on older versions, but clang 9.0 to 15.0 generates the optimal (AFAICS) code:
Code: [Select]
Median3d:                               # @Median3d
        movapd  xmm3, xmm0
        minsd   xmm3, xmm1
        maxsd   xmm1, xmm0
        maxsd   xmm2, xmm3
        minsd   xmm1, xmm2
        movapd  xmm0, xmm1
        ret
Median3f:                               # @Median3f
        movaps  xmm3, xmm0
        minss   xmm3, xmm1
        maxss   xmm1, xmm0
        maxss   xmm2, xmm3
        minss   xmm1, xmm2
        movaps  xmm0, xmm1
        ret
Median3l:                               # @Median3l
        cmp     rdi, rsi
        mov     rax, rsi
        cmovle  rax, rdi
        cmovl   rdi, rsi
        cmp     rax, rdx
        cmovl   rax, rdx
        cmp     rdi, rax
        cmovle  rax, rdi
        ret
Median3i:                               # @Median3i
        cmp     edi, esi
        mov     eax, esi
        cmovle  eax, edi
        cmovl   edi, esi
        cmp     eax, edx
        cmovl   eax, edx
        cmp     edi, eax
        cmovle  eax, edi
        ret
Median3c:                               # @Median3c
        cmp     dil, sil
        mov     eax, esi
        cmovle  eax, edi
        cmovl   edi, esi
        cmp     al, dl
        cmovl   eax, edx
        cmp     dil, al
        cmovle  eax, edi
        ret
Median3s:                               # @Median3s
        cmp     di, si
        mov     eax, esi
        cmovle  eax, edi
        cmovl   edi, esi
        cmp     ax, dx
        cmovl   eax, edx
        cmp     di, ax
        cmovle  eax, edi
        ret
GCC 8.5 to 12.2 tends to generate the 9-instruction form with conditional jumps for floats and doubles, with -O2 generating better code (fewer jumps) than -Os:
Code: [Select]
Median3d:
        comisd  xmm1, xmm0
        ja      .L2
        movapd  xmm3, xmm1
        movapd  xmm1, xmm0
        movapd  xmm0, xmm3
.L2:
        maxsd   xmm2, xmm0
        minsd   xmm1, xmm2
        movapd  xmm0, xmm1
        ret
Median3f:
        comiss  xmm1, xmm0
        ja      .L7
        movaps  xmm3, xmm1
        movaps  xmm1, xmm0
        movaps  xmm0, xmm3
.L7:
        maxss   xmm2, xmm0
        minss   xmm1, xmm2
        movaps  xmm0, xmm1
        ret
Median3l:
        cmp     rsi, rdi
        mov     rax, rdi
        cmovle  rax, rsi
        cmp     rax, rdx
        cmovl   rax, rdx
        cmp     rsi, rdi
        cmovl   rsi, rdi
        cmp     rax, rsi
        cmovg   rax, rsi
        ret
Median3i:
        cmp     esi, edi
        mov     eax, edi
        cmovle  eax, esi
        cmp     eax, edx
        cmovl   eax, edx
        cmp     esi, edi
        cmovl   esi, edi
        cmp     eax, esi
        cmovg   eax, esi
        ret
Median3c:
        cmp     sil, dil
        mov     eax, edi
        cmovle  eax, esi
        cmp     al, dl
        cmovl   eax, edx
        cmp     sil, dil
        cmovl   esi, edi
        cmp     al, sil
        cmovg   eax, esi
        ret
Median3s:
        cmp     si, di
        mov     eax, edi
        cmovle  eax, esi
        cmp     ax, dx
        cmovl   eax, edx
        cmp     si, di
        cmovl   esi, edi
        cmp     ax, si
        cmovg   eax, esi
        ret
It looks like GCC optimization "fails" because it switches too early from conditional loads to conditional jumps; that clang has better optimization for this kind of comparison trees.
Title: Re: Fun simple exercise: median value
Post by: gf on November 15, 2022, 10:54:48 am
It looks like GCC optimization "fails" because it switches too early from conditional loads to conditional jumps; that clang has better optimization for this kind of comparison trees.

If called in an innermost loop (e.g. median filter) and if each loop iteration randomly happens to take a different branch, then branch misprediction can have a significant performance impact, at least on superscalar processors with branch prediction. Then the branchless version of the code is likely preferable.
Title: Re: Fun simple exercise: median value
Post by: Nominal Animal on November 15, 2022, 01:19:21 pm
It looks like GCC optimization "fails" because it switches too early from conditional loads to conditional jumps; that clang has better optimization for this kind of comparison trees.
If called in an innermost loop (e.g. median filter) and if each loop iteration randomly happens to take a different branch, then branch misprediction can have a significant performance impact, at least on superscalar processors with branch prediction. Then the branchless version of the code is likely preferable.
True.  The optimization "failure" is specifically that instead of compiling
    type  a1 = (a < b) ? a : b,  b1 = (a < b) ? b : a;
as if it was
    type  a1 = min(a, b),  b1 = max(a, b);
it compiles as if it was
    type  a1, b1;
    if (a < b) {
        a1 = a; b1 = b;
    } else {
        a1 = b; b1 = a;
    }
i.e. the comparison pairs are not simplified to min/max (here; they are later on!), and two comparisons are considered more costly than two assignments (explaining why the generated code combines them to a single conditional jump with assignment/swap in one branch).  Thus, I think that the problem is that on x86-64, GCC considers two comparisons more costly than two assignments.  Apparently, its optimization facilities do not take conditional move only -branches into account, and instead assumes that each comparison will result in a costly branch, and tries to consolidate the comparisons.



I really don't like fighting against the compiler, so I think that something like the following would be useful on x86-64:
Code: [Select]
#if defined(__x86_64__) && (defined(__GNUC__) || defined(__clang__)) && !defined(NO_INLINE_ASM)

__attribute__((__const__))
static inline double Median3d(double a, double b, double c)
{
    asm volatile ( "minsd   %[d], %[b]\n\t"
                   "maxsd   %[b], %[a]\n\t"
                   "maxsd   %[c], %[d]\n\t"
                   "minsd   %[b], %[c]"
                 : [d] "+&x" (a), [b] "+&x" (b), [c] "+x" (c)
                 : [a] "x" (a));
    return b;
}

__attribute__((__const__))
static inline float Median3f(float a, float b, float c)
{
    asm volatile ( "minss   %[d], %[b]\n\t"
                   "maxss   %[b], %[a]\n\t"
                   "maxss   %[c], %[d]\n\t"
                   "minss   %[b], %[c]"
                 : [d] "+&x" (a), [b] "+&x" (b), [c] "+x" (c)
                 : [a] "x" (a));
    return b;
}

#else

static inline double Median3d(const double a, const double b, const double c)
{
    const double  a1 = (a < b) ? a : b, b1 = (a < b) ? b : a;
    const double  c1 = (a1 < c) ? c : a1;
    return (b1 < c1) ? b1 : c1;
}

static inline float Median3f(const float a, const float b, const float c)
{
    const float  a1 = (a < b) ? a : b, b1 = (a < b) ? b : a;
    const float  c1 = (a1 < c) ? c : a1;
    return (b1 < c1) ? b1 : c1;
}

#endif
In other words, on x86-64 with GCC and Clang, the function is implemented as an inlineable extended asm (with inline only there as a reminder for us humans), unless NO_INLINE_ASM is defined as a macro.  Otherwise, the C equivalent is used.

As to the extended asm spec, [d] and [b] are modified before all parameters have been used, so they're read-write earlyclobber SSE registers (+&x), [c] is modified but after all parameters have been used, so it's just read-write SSE register (+x); and [a] is a read-only SSE register (x).  The initial value of [d] is a, others per their names.  The instructions do not modify any flags or registers besides the three outputs ([d], [b], [c]), so no clobbers are needed.
Title: Re: Fun simple exercise: median value
Post by: MK14 on November 15, 2022, 04:04:26 pm
There is a similar, but DIFFERENT specification, thread.  That might have some tips, for further speeding this thing up.  Because it is for finding the MINIMUM of three values, rather than their median.

But two possible speedups, it still seems to offer, are possibly useful here.  There are SSE (suitable SSE version number or AVX etc) instructions.  Which can do multiple, min (or maxes, I presume), in one go.  It might be possible to rearrange things, to use them, to possibly speed things up and reduce or eliminate any branches.  It also has a tip, for making sure GCC produces better optimal code, by making sure the machine architecture type, is set correctly, rather than leaving it to defaults.  Which may or may not apply, to the later versions of GCC, we are discussing.
Anyway, it seems to make an interesting read, if you are into that sort of thing.

https://stackoverflow.com/questions/2039730/fastest-way-to-find-out-minimum-of-3-numbers
Title: Re: Fun simple exercise: median value
Post by: gf on November 15, 2022, 04:18:45 pm
I just wonder, do gcc and clang also support the _mm...() intrinsics from Intel?
Title: Re: Fun simple exercise: median value
Post by: Nominal Animal on November 15, 2022, 05:20:34 pm
I just wonder, do gcc and clang also support the _mm...() intrinsics from Intel?
Yes, after you include <immintrin.h>.  It is provided by the compiler, not by the C library.  (The header file implements the Intel intrinsics using compiler-specific built-ins and idioms.)

Basic arithmetic on vectors (using the vector_size variable/type attribute) does work portably without any header files.  For example,
    typedef  float  vec4f __attribute__((vector_size (4*sizeof (float))));
defines a four-component float vector.  Component-wise addition, subtraction, multiplication, and division is supported, regardless of whether the current architecture supports it in hardware; but if it does support some form of SIMD for the type, then the compiler will use it.  Given
    vec4f  a = { 1.0f, 2.0f, 3.0f, 4.0f }, b = { 5.0f, 6.0f, 7.0f, 8.0f };
    vec4f  sum = a + b, prod = a * b;
you have sum[0] == 6.0f, sum[3] == 12.0f, prod[0] == 5.0f, prod[1] == 12.0f, prod[3] == 32.0f, and so on. 
Title: Re: Fun simple exercise: median value
Post by: Nominal Animal on November 15, 2022, 06:45:28 pm
There is a similar, but DIFFERENT specification, thread.  That might have some tips, for further speeding this thing up.
For ARM Neon, perhaps in theory; for SSE/AVX, not really: the two min and two max operations are as good as it gets, although as there are multiple ways of writing them, there might be one that gets executed faster than others because of the register access patterns.  Note that the four min/max operations cannot be parallelized except for one min-max pair, since the two others depend on the results of that pair.  ARM Neon has minimum and maximum across vector, but SSE/AVX does not, and neither has "sort across vector".

If you use SSE/AVX or ARM Neon, you can easily do e.g. four median-of-three operations in parallel,
Code: [Select]
#if defined(__x86_64__)

typedef  float  vec4f __attribute__((vector_size (4*sizeof (float))));
#define  vec4f_min  __builtin_ia32_minps
#define  vec4f_max  __builtin_ia32_maxps

#elif defined(__arm__) || defined(__aarch64__)

#include <arm_neon.h>
typedef  float32x4_t  vec4f;
#define  vec4f_min    vminq_f32
#define  vec4f_max    vmaxq_f32

#else
#error Only x86-64 and ARM NEON supported.
#endif

vec4f median3f4(vec4f a, vec4f b, vec4f c)
{
    const vec4f  a1 = vec4f_min(a, b),
                 b1 = vec4f_max(a, b);
    const vec4f  c1 = vec4f_max(a1, c);
    return vec4f_min(b1, c1);
}
so that m = median3f4(a, b, c); yields m[0] with the median of a[0], b[0], and c[0]; m[1] with the median of a[1], b[1], and c[1]; m[2] with the median of a[2], b[2], and c[2]; and m[3] with the median of a[3], b[3], and c[3].

Note that with the vec4f type, addition, subtraction, multiplication, and division is component-wise.  With vec4f a, b, c; the expression c = a OP b; is equivalent to c = (vec4f){ a[0] OP b[0], a[1] OP b[1], a[2] OP b[2], a[3] OP b[3] }; for + - * / OPs at least.  For other stuff – reciprocals, approximate reciprocals, square roots, and minimum/maximum as above –, you need intrinsics.
(This also means that it is easier to implement your own functions as SFMD – single function, multiple data –, applying the same function to datasets in parallel, with each dataset in one vector variable, containing 2, 4, 8, or 16 members.)

You can find GCC vector extension details here (https://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html); here for SSE/AVX (https://gcc.gnu.org/onlinedocs/gcc/x86-Built-in-Functions.html), and here for ARM (https://arm-software.github.io/acle/main/acle.html).
Title: Re: Fun simple exercise: median value
Post by: MK14 on November 15, 2022, 07:46:00 pm
There is a similar, but DIFFERENT specification, thread.  That might have some tips, for further speeding this thing up.
For ARM Neon, perhaps in theory; for SSE/AVX, not really: the two min and two max operations are as good as it gets, although as there are multiple ways of writing them, there might be one that gets executed faster than others because of the register access patterns.  Note that the four min/max operations cannot be parallelized except for one min-max pair, since the two others depend on the results of that pair.  ARM Neon has minimum and maximum across vector, but SSE/AVX does not, and neither has "sort across vector".

If you use SSE/AVX or ARM Neon, you can easily do e.g. four median-of-three operations in parallel,
Code: [Select]
#if defined(__x86_64__)

typedef  float  vec4f __attribute__((vector_size (4*sizeof (float))));
#define  vec4f_min  __builtin_ia32_minps
#define  vec4f_max  __builtin_ia32_maxps

#elif defined(__arm__) || defined(__aarch64__)

#include <arm_neon.h>
typedef  float32x4_t  vec4f;
#define  vec4f_min    vminq_f32
#define  vec4f_max    vmaxq_f32

#else
#error Only x86-64 and ARM NEON supported.
#endif

vec4f median3f4(vec4f a, vec4f b, vec4f c)
{
    const vec4f  a1 = vec4f_min(a, b),
                 b1 = vec4f_max(a, b);
    const vec4f  c1 = vec4f_max(a1, c);
    return vec4f_min(b1, c1);
}
so that m = median3f4(a, b, c); yields m[0] with the median of a[0], b[0], and c[0]; m[1] with the median of a[1], b[1], and c[1]; m[2] with the median of a[2], b[2], and c[2]; and m[3] with the median of a[3], b[3], and c[3].

Note that with the vec4f type, addition, subtraction, multiplication, and division is component-wise.  With vec4f a, b, c; the expression c = a OP b; is equivalent to c = (vec4f){ a[0] OP b[0], a[1] OP b[1], a[2] OP b[2], a[3] OP b[3] }; for + - * / OPs at least.  For other stuff – reciprocals, approximate reciprocals, square roots, and minimum/maximum as above –, you need intrinsics.
(This also means that it is easier to implement your own functions as SFMD – single function, multiple data –, applying the same function to datasets in parallel, with each dataset in one vector variable, containing 2, 4, 8, or 16 members.)

You can find GCC vector extension details here (https://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html); here for SSE/AVX (https://gcc.gnu.org/onlinedocs/gcc/x86-Built-in-Functions.html), and here for ARM (https://arm-software.github.io/acle/main/acle.html).

That's very interesting, if the Arm Neon can do Min or Max, across a vector.  I can imagine some uses, for such a thing.  Such as perhaps, a highly optimized sort/select, and other things.

In practice (i.e. with many real life programs).  Being able to perform the median3, on 4 sets of numbers, at a time.  Could be useful for speeding up, some real life applications.  Because not all of them, would be forced, to only calculate one median3 at a time, occasionally.

Also, (as I understand it), something done as a one-off (or very rarely), where the code only does something once, rather than in a loop.  Could find that it is the cache loading time, which dominates the wall clock, real life execution time.

I.e. Simply compressing the size of the code, even if the theoretical cycle count, may increase.  Could result in the fastest possible wall clock, real life speeds.  N.B. Only when it runs, once only, or very rarely.

Some people complicate/muddy this issue.  By recommending, something on the lines, that the benchmark results, are only counted.  Once the program has run for the first time, and hence been loaded into the caches.

I suppose, testing and experimentation results.  Will give a better idea, as to what programing practices are best, for particular applications.

Many are simply written in an interpreted/script language anyway, such as Python or Javascript.  So, extreme optimizations, is not such a popular subject area these days, I guess.  So many things, easily run, fast enough, that speed is not really an issue. 
Title: Re: Fun simple exercise: median value
Post by: MK14 on November 15, 2022, 08:03:00 pm
ARM Neon has minimum and maximum across vector, but SSE/AVX does not, and neither has "sort across vector".

I've had a somewhat quick look, and don't seem to have found such instructions (using Arm Neon).  Have you got an instruction name or link, as I find that very interesting?

I have found some other instructions, such as vminq_f32 (which was in your post, or something similar to it was).  But, that apparently, takes a pair of 4 way vectors, and creates a new vector (return value), which comprises of the 4 minimum values, between the two supplied, 4 way vectors.
I.e. It seems to be between 2 vectors, pair by pair, rather than across an entire vector, to produce a single minimum value.

EDIT: It is getting confusing.  A number of people seem to be saying, it is ONLY vector to vector.

But the Arm manual seems to confuse the issue, by saying:
Quote
Minimum across vector

But the description, seems to point out, that it actually is a value pair (between each place in the supplied vectors), rather than being genuinely across a single vector, to produce a single, overall minimum value.

[/s]
Quote
reads each pair of adjacent vector elements from the concatenated vector, writes the smaller of each pair of values into a vector, and writes the vector to the destination

Source:
https://developer.arm.com/architectures/instruction-sets/intrinsics/vminv_f32

EDIT2:
Sorry!
Yes, I managed to find the instruction(s).  It apparently DOES do it.  I'm impressed.  It sounds powerful.
There seem to be places on the internet that (confusingly), say it can't  But I think, the Neon version(s), which do it, are fairly recent.  So, those sources were probably too old.

https://developer.arm.com/architectures/instruction-sets/intrinsics/vminvq_f32

vminvq_f32

Quote
Floating-point Minimum across Vector. This instruction compares all the vector elements in the source SIMD&FP register, and writes the smallest of the values as a scalar to the destination SIMD&FP register. All the values in this instruction are floating-point values.
Title: Re: Fun simple exercise: median value
Post by: Nominal Animal on November 16, 2022, 10:35:46 am
Yup, it is quite complicated to find out exactly what a specific microcontroller happens to support.

For example, Teensy 4.0 and 4.1 use the NXP i.MX RT1062 processor, an ARM Cortex M7 (actually M7F, since it has single and double precision FP support), with the DSP extensions.  Here is my current understanding:

ARM Cortex M7 uses ARMv7E-M architecture.

The ARMv7E-M instruction set is documented here (https://developer.arm.com/documentation/ddi0403/d/Application-Level-Architecture/The-ARMv7-M-Instruction-Set).  The standard floating-point data processing instructions are listed here (https://developer.arm.com/documentation/ddi0403/d/Application-Level-Architecture/The-ARMv7-M-Instruction-Set/Floating-point-data-processing-instructions?lang=en), and the DSP extensions here (https://developer.arm.com/documentation/ddi0403/d/Application-Level-Architecture/The-ARMv7-M-Instruction-Set/Data-processing-instructions/Parallel-addition-and-subtraction-instructions--DSP-extension?lang=en).

As a simple overview, the DSP extensions are 8-, 16-, and 32-bit signed modulo or saturating, or unsigned saturating, modulo, or halving, additions and subtractions, with optional swapping of 16-bit units within each 32-bit word.  The FP support is non-vectorized, i.e. single instruction single data and not SIMD, but includes rounding and conversion, absolute value, comparisons, add, subtract, multiply, divide, fused multiply add/subtract, fused negate multiply add/subtract, multiply add/subtract, multiply add/subtract negate, multiply negate, and square root.
No Neon or Helium.  For GCC (https://gcc.gnu.org/onlinedocs/gcc/ARM-Options.html), the proper options for this one are -mfloat-abi=hard -march=armv7e-m+fp.dp -mtune=cortex-m7.

In the GCC ARM options (https://gcc.gnu.org/onlinedocs/gcc/ARM-Options.html), we can also find architectures where Neon or Helium = MVE might be implemented. For example, Neon seems to only be found in ARMv7-a, i.e. on Cortex-A cores.  Cortex-M55 is ARMv8.1-M, and can support Helium; all ARMv8.1-M can.  The details of it are in ARMv8-M Architecture Reference Manual (https://developer.arm.com/documentation/ddi0553/latest), and the interesting instruction for "maximum across vector" is VMAXNMV (or VMAXNMAV for absolute values).  So, if an ARM core supports Helium (or M-Profile Vector Extension, MVE), it can do "maximum across vector" and even "maximum magnitude across vector" in a single instruction, loading the maximum (and similarly minimum) to a general-purpose register from the vector register.

But boy, is this tedious to walk through. So many names and extensions, it is drudge work to dredge all the details up!
Title: Re: Fun simple exercise: median value
Post by: SiliconWizard on November 16, 2022, 06:29:48 pm
If you think ARM is fragmented with too many extensions, you should have a look at RISC-V processors. ;D
Title: Re: Fun simple exercise: median value
Post by: Nominal Animal on November 17, 2022, 07:05:46 am
If you think ARM is fragmented with too many extensions, you should have a look at RISC-V processors. ;D
Funnily enough, I do not (think that ARM or RISC-V is fragmented with too many extensions), not even at the emotive/instinctual level.  I really do consider it only a documentation issue, and do not mind the actual fragmentation/extensions at all.  To me, it just means there is a wider palette to choose a tool from. :-+
Title: Re: Fun simple exercise: median value
Post by: westfw on November 17, 2022, 10:58:41 am
Quote
[architecture extension fragmentation is] only a documentation issue
I would be a bit happier if vendors (either IP or Chip, or both) were better at documenting it than they seem to be.  Having an instruction set reference manual that doesn't actually say which instructions are in which extension is pretty awful.  (the ARMv7-m ARM was pretty good about identifying which instructions were part of the DSP or floating point extension.  But the ARMv8-M ARM lumps together "baseline" ("v6m compatible") and "Main extension" (v7m compatible) cores, and isn't very careful about saying which instructions are in which.  (SOMETIMES it says "Main Extension Only", but other instructions it's a lot less clear.  CBZ, CBNZ, LDREX... were not supported by v6m - have the actually been added to the v8m baseline?  I can't tell...))
Title: Re: Fun simple exercise: median value
Post by: Nominal Animal on November 17, 2022, 11:57:04 am
I would be a bit happier if vendors (either IP or Chip, or both) were better at documenting it than they seem to be.
Me too, westfw, absolutely!

This reminds me of the time I studied calculus at Uni (Mathematical Methods for Physicists), and instead of showing how to choose a solution approach (separation of variables, direct integration, a particularly useful variable substitution, etc.) depending on the properties of the differential equation at hand, the lecturer proved the solutions work, and left the "problem of deciding which approach to choose as an exercise for the reader". :palm:

The only viable solution I know of, is to go the Felix Cloutier (https://www.felixcloutier.com/x86/) way, and tabulate/spreadsheet the instruction sets for oneself.  It would be so easy to make that into an interactive web page, similar to a spreadsheet, with wiki-like snippet subpages describing each instruction in detail, but with selectable instruction sets and extensions.  A "master" selector with known microcontroller cores would be really helpful, as selecting a microcontroller would select just the architecture and valid extension sets, so just by flipping that selector one could easily compare different cores.

Same is also possible for peripheral units, even with model/manufacturer specific exceptions and tags (that are normally only listed as footnotes or errata), but I bet the different manufacturers wouldn't like to see their MCUs compared so brazenly face-to-face.