I'd expect 64 bit only CPUs to become more and more common.
My guess is, five years from now, everything is going to be 64-bit only. Maintaining multiple instruction sets (in hardware and software alike) is just a waste. IMHO everything was just fine in 32-bit world, but you cannot stop marketing
32 bits is certainly enough for many or most embedded purposes, or even 8 bits (with 16 bit addresses). The issue with individually-packaged microcontrollers (as used by hobbyists and small scale manufacturing) is that as process sizes shrink and/or the number of I/O pins increases, the size of your silicon die becomes determined purely by having enough perimeter to accommodate all the I/O wires. If you use an 8 bit core then it can end up as a tiny island in the middle of the chip surrounded by an ocean of unused (but still paid-for) silicon. The incremental cost of putting a 32 bit or even 64 bit CPU core in there can be exactly zero, and they do have benefits in ease of programming and work done per instruction, and even work done per Joule.
and here we go, well on the way to 64-bit. I don't know whether they are going to switch to 128-bit some day, but I'm sure this will not be on my lifetime. So, 64-bit it is!
RISC-V has been designed from the start to accommodate an eventual move to 128 bits, with instruction encodings reserved for the necessary extra instructions. It turns out there are some people who actually would like to have this bigger address space very soon. It's certainly easy enough to build if someone does.
I also don't expect regular desktop or handheld machines to outgrow 64 bit addresses in my lifetime.
If you consider that home computers graduated from 8 or 12 bit to 16 bit addressing around 1976, to 32 bit around 1984 (Mac) or 1985 (Atari ST, Amiga) or 1987 (Compaq 386), and to 64 bit around 2002 (Athlon64) to 2006 (Core 2 Duo) we have a pretty steady increase of around 1.6 bits per year which implies home computers will start to switch to 128 bit addresses around 2046.
My heart says it's slowing down now and will take longer, but who knows?
BTW: Does RISC-V have SIMD support which is as big as Aarch64 or x64?
The currently ratified RISC-V instruction set has no SIMD support at all. There are two complementary standardization efforts underway:
1) the "P" packed SIMD / DSP extension. This was penciled in from the start but when work actually commenced the Chinese company Andes stepped up and said something along the lines of "We have an existing industry-proven over a number of years SIMD/DSP instruction set from our NDS32 ISA which we have already ported into our new RISC-V cores we are shipping to customers and we are prepared to donate it to the RISC-V Foundation".
I believe the Working Group has made some changes to it, but I think not a lot. I haven't paid close attention.
This does SIMD in the 32 integer registers. Maybe that's another reason some people might want 128 bit registers, even though 32 bit might be enough for their memory addressing needs.
2) the "V" Vector extension. This is based on work at UCB through the 2000s and into the 2010s, inspired by the original Cray vector supercomputers. ARM has recently specified the very similar (but more limited) SVE and MVE (aka Helium) vector extensions, but they haven't yet released any cores implementing them (not sure about MVE, but certainly not SVE). Fujitsu have a prototype chip for supercomputers implementing SVE.
The V extension provides a new register file with 32 registers of size anywhere from 32 bits to 2^32 bits with the same program binary working on vector registers of any size. I'm sure no one will ever build physical registers with 2^32 bits, but it's possible someone will make some kind of streaming implementation that appears to the program to. ARM SVE also has 32 registers but they are architecturally limited to the range 128 bits to 4096 bits. I think we'll likely see some RISC-V implementations with vector registers of 16k or 64k bits in the fairly near future.
It's likely many early implementations will use 512 bit registers, as this matches the cache line size, which means it can make near optimal use of the memory subsystem.
Remember Ken Batcher's ~1970 quip: "A supercomputer is a device for turning compute-bound problems into IO-bound problems". The same is true of properly-implemented Vector processing units, except IO has become memory-bandwidth.