Yep, this is why in my core I've split decoding and register reading into separate stages. Otherwise it was on the critical path. Now execute/alu64 is on the critical path, and there is nothing I can do about it, and I don't want to make it multicycle. That is part of the reason why I want to try Tomasulo out of order approach - this way I can make shifts (which is the critical path inside alu) multicycle without affecting other ALU operations. I've also added full bypassing, and as I expected, it did not cause any timing problems.
You can of course easily split the six layers of a logarithmic 64 bit shifter into two pipeline stages with three shift layers in each stage.
The problem is this almost certainly isn't going to let you double the clock frequency as something else will become the critical path long before that. That means that while every other instruction will now go a little faster, shifts will now be slower. If the clock speed improvement is small (under 5%?) it might not be a win overall. Shifts are pretty common in some types of code, and using the base ISA (no B extension) there is a regrettable amount of back-to-back shifts to do things such as zero-extending a 32 bit unsigned value to a 64 bit value whenever someone has foolishly used an unsigned int (32 bits) as a loop index (etc) and then uses it to index an array. People who use size_t or similar for their loop indexes are fine. As are people who use "int".
Coremark is an unfortunate offender here. They have gone out of their way to typedef critical variables as "unsigned int" because this produces better code than "int" on 64 bit ARM CPUs. But it de-optimises 64 bit RISC-V. For some time, RISC-V people were changing the typedef, but the benchmark owner made a ruling this is not permitted.
The RISC-V B extension adds some instructions which do 64 bit computations after implicitly zero-extending one of the operands, which eliminates this problem. And also instructions to sign-extend and zero-extend 8, 16, and 32 bit quantities to 64 bits with a single instruction instead of a pair of shifts.
But anyway -- it would be very interesting to know what becomes the critical path after you split shifts into 2 stages, and how much frequency increase you can then get.