You can register the ALU output or not, it's up to you. It all comes down to timing and whether speed matters. If, in one clock cycle, you intend to fetch an operand, do some ALU work, send the output through a bus buffer and try to write it into memory, it had better be a really long cycle.
But the other side is complexity and pipelining. If you are going to waste a clock by registering the ALU, the cycles can be faster but there are more of them. What else can you be doing while the extra clock cycle is writing back to memory?
Which brings up von Neumann versus Harvard Architecture. In the von Neumann approach, all memory is equal. As a result, you can't be fetching while you are writing back. In the Harvard approach, program memory is separate and on a different bus. This allows for overlapping fetch and result write-back.
And that brings up RISC architecture. Never read or write an ALU operand to or from memory. All ALU operations occur between registers and there are a very few instructions that load or save registers. Then we have two images of the register bank so we can select any register for either side of the ALU. When we write back, we write both images simultaneously. In any event, it is common to use a 5 stage pipeline for RISC instructions. See "Digital Design and Computer Architecture - ARM Edition" by Harris and Harris.
One approach to studying timing is to use Excel spreadsheets to explore the various instructions. Really, the design of the datapath is the bulk of the work. "Microprocessor Design Using Verilog VHDL", written by one of the original Z80 designers, goes through this process for the Z80 CPU.
I would hope you would keep your questions over here but in the Micros and FPGAs forum. There are some very knowledgeable folks who haunt that forum.