Can you feed 1,2,3,4 sincronized in both inputs and look if 1 arrives with 8 at the multiplier, for example?
I suppose I could try stimulating it as you suggest, however that sounds like it might be prone to errors.
Perhaps there is a way to get the timing analysis to tell me this?
No, not really. Timing analysis is focused on the clock cycle, not over a number of clock cycles.
IMO you should structure your code so you can't end up with mismatched latency in your pipelines. It is quite easy to do if you can get away from "I must do this in what feels like the most 'source code' efficient way" rather than a "I must do this in a way that minimises errors and adds flexibility even if it does make the code look verbose".
A bit of 'scaffolding' will make this much easier. Don't plug the output of one computation onto the inputs of the next, but set up a framework and plug the calculation components onto that, then let the tools prune away unused resources. The end result will be equally efficient, or maybe even better as the optimiser may more freedom to shuffle registers around.
In both cases debugging involves simulating and sending a impulse down the pipeline, and checking the results come out in the correct alignment...