tchiwam,
Agreed, moving as much as you can out of the path helps. The issue I had was that I was down to only one item in the path, and I couldn't move it out of the path. 70% of the delay was coming from the net, not from the logic, so the usual approach of moving things around in the pipeline wasn't going to work, it needed a shorter net delay.
pbernardi,
You are of course correct that a faster speed grade might get timing closure where the slower device cannot, but this misses the point that the slower device *should* have been able to get timing closure. Remember that the only reason it was failing was due to net delays... the logic was plenty fast enough.
With regard to the BRAM suggestion, yes you can MUX wider data into the DSP48E1. That wasn't going to work for my application (which is more latency oriented than throughput - which is why adding a pipeline stage is a PITA since that requires everything else to compensate for that extra latency), but yes, for raw throughput it could certainly work... though only up to 464MHz since that's the max pipelined throughput of a DSP48E1 in the SG1 Spartan7. [You can of course run the DSP48E1 as 2 lots of 24 bit-wide data, or 4 lots of 12 bit-wide-data, for a max of 1856M data-elements per second, per DSP48E1].
All,
In order to avoid confusion, and because I did actually have some failures which had *no* logic in the path at all, I'll concentrate on that case and not muddy the waters with having logic in the path :
The major lesson I learnt from this exercise is that the timing report does not necessarily allude to the fact that a net is not necessarily monolithic - ie that it is comprised of multiple segments, each of which adds its own delay to the net and that one can leverage that fact to gain timing closure by introducing extra flops somewhere along this path... again, just to be clear, we are talking about a direct connection from point A to point F, which is reported as a total net delay between two sequential elements with *no logic* inbetween them (it's just a wire). In reality this takes a path ABCDEF, representing multiple segments of the same net (*no* logic!). By breaking this path at C/D you have two sub-paths ABC and DEF. A flop between C and D solves the timing closure since an edge launched at A arrives at C in time. Also an edge launched at D arrives at F in time. It does introduce a pipeline delay, and you may have to compensate for that, BUT it will get timing closure where previously you couldn't.
Separately, even when doing this, it was necessary to get Vivado to do post-route PhysOpt, but once that was done, it worked

Cheers,
Pat.