hamster_nz,
Many thanks for your workflow suggestions. It pretty much mirrors what I was doing in order to find where things were going awry. The major issue is with the net delays. If it was going through a lot of logic, and the tool had shown me that, then it would have been a case of tweaking the HDL. When there's only one LUT in the path and it still fails, it gets "interesting"

Someone,
Think we are slightly cross-purposes on what I mean with regard to the tool "helping" - yes I would like the tool to move fabric regs if that is what it needs to do in order to lower net delay to close timing, but no I would not want it moving an instantiated DSP48E1 reg out into the fabric on its own.... if it is not possible to close timing on an instantiated DSP48E1 then I would expect to have to change the instantiation to disable that reg, and to create a register in the HDL. Thankfully, it doesn't appear to have done that (yet, maybe it is lulling me into a false sense of security, LOL).
Not sure quite what you're after regarding working / reference / source, but the information I stated came from the datasheet and the implementation timing report. It was telling me that I have a slack of -129ps. The clock was declared at 400MHz, or 2500ps period. At 75% of 464MHz we are at 348MHz as you state, which is a period of 2873ns, a difference of 373ps, which is significantly more than the 129ps I needed to have zero slack, and would have closed timing if I had dropped the clock to that speed.
The LUTRAM frequency is a bit of an enigma. We both agree the datasheet states 2.5ns period / 400MHz max, but when you look at the timing report it shows a tiny propagation delay from read address in to data out. I guess that shouldn't be surprising since these are LUTs used as RAM and they are fast in that mode. I suspect the limit has more to do with the synchronous write side of the DistRAM than the async read.
With regard to the sg2 / pointy end / hard things being hard - I had failed to appreciate just how "slow" the signal routing is. Let me explain. I am seeing net delays up to 1.5ns. That's the best part of 450mm worth of propagation at c, and even if phase velocity was as low as 0.66 then it would still be 300mm.... in other words many, many die-sizes worth, so I just wasn't expecting such long delays. You live and learn

As an exercise I did try the experiment of adding another pipeline stage whose sole purpose in life is to reduce net delays and I am pleased to report that it worked.... after a little prodding.... needed to add a RAM_STYLE attribute and it did need one round of post-route PhysOpt to close timing - then it managed to get the slack up from -129ps to 43ps

That being said, I suspect that as I add more things, it is going to come back and haunt me again....
Many thanks,
Pat.