I believe x86 SIMD uses separate registers.
Correct. SSE2 has 16 128-bit arithmetic registers XMM0 - XMM15, that can be treated as two double-precision floating-point numbers, four single-precision floating-point numbers, two 64-bit integers, four 32-bit integers, eight 16-bit integers, or sixteen 8-bit integers, signed or unsigned, depending on the instruction. These are distinct from 387 floating-point registers. AVX renames those to YMM0-YMM15, extending them to 256 bits, AVX2 adding 128-bit and 256-bit integer operations. AVX512 renames them to ZMM0-ZMM31, not just doubling their number but extending them to 512 bits.
These are completely separate from the normal AMD64 (x86-64) general-purpose registers (rax, rbx, rcx, rdx, rsi, rdi, rbp, r8, r9, r10, r11, r12, r13, r14, and r15), and use a completely different set of instructions.
Single-precision floating point vectors are widely used in image and geometry processing (including wavelet transforms and such, unless done using a GPU), and also heavy sound analysis (single-precision FFT/DFT/Hartley transforms and such). Double-precision floating point vectors are heavily used in computational physics -- basically both ab-initio (quantum mechanics;
vasp et cetera) and classical (potentials;
lammps,
gromacs et cetera). Using the binary operations on the floating-point values is also surprisingly common (absolute values, min/max, masking/conditionals). The major use for the various integer operations is speeding up cryptographic operations, which nowadays are absolutely ubiquitous; not just in securing socket communications, but in internal kernel operations (like ensuring unpredictability of kernel random number sources).
As to the underlying microcode and hardware implementation, it looks like Intel and AMD implementations do differ quite a bit. Mathematically their results are identical, but how different operations pipeline, and how efficient vector-intensive operations are, varies a lot between processor families.
Accelerating cryptographic operations, double-width unsigned integer multiplication is an absolute must. (Meaning, you really need a multiplication operation
C = B × A where
C is a pair of registers, or double the width of
A and
B. Apologies for the poor terminology; me fail English today worse.) The size of the unsigned integers we deal with will only increase; right now ordinary workstations do a surprising amount of work on 2048-bit and higher unsigned integers. So, it is not just the size of the registers that matters, the basic operations (addition, subtraction, multiplication, and, or, xor, not) must also be fast/efficient enough to warrant their use.
(It turns out that at least some Intel implementations of AVX2 and AVX512 are not really worth the extra cost when mostly using double-precision floating-point vectors. Bummer. But, that is the reason CERN and others doing heavy physics computations, really do not want to be using the very newest hardware, but on hardware chosen based on amount of computation per euro achieved. Theoretical gains look nice on paper, but practice trumps theory. That said, a lot of the existing simulation software and surrounding services (the CERN data is structured, not just "flat files"), is horribly inefficient design, and a lot more could be done to fix that... but don't get me started on that. And yes, I have been an admin of a HPC cluster used to munch on terabytes of CERN data. Even built an auto-evaluation Linux USB stick with actual simulations for vendors to measure the performance of vendor offerings for a new cluster acquisition, once.)