If there are only two, they will just say theirs is right and the other's is wrong.
I discovered the benefit of testing simultaneously on multiple platforms during a 500,000 line port from VMS to Unix. That code had fewer than a dozen bugs reported in the first release and it went down from there. Because we built tests cases and ran them on multiple platforms as we wrote the code we quickly learned what things to avoid. That code was in service for 12-16 years and completely unsupported for 4-6 years. They only pulled the plug when it simply became obsolete.
I wish it was possible to write generic VHDL (or Verilog, I prefer and use the former daily) that would let me be vendor-agnostic. But the reality is that you can't. And believe me, I've tried, even going so far as to use VHDL configurations and generates to swap out vendor-specific blocks. You end up with spaghetti, the kind that's been sitting in the drainer for a few hours because your wife made dinner and you were still a work, and now it's a blob of paste in the sink.
The good news is that inferring standard things like RAMs and ROMs is portable.
Things as simple as input DDR blocks aren't the same from vendor to vendor. Some families have input and output serializers, and there are all sorts of specific clocking requirements that make porting difficult. Clock resources are all over the place, some have PLLs, some have DLLs, some ahve both, some have just delay elements. Some families have input delays on all pins, some only on clock pins.
Hard blocks which require instantiation are not portable. Altera's gigabit serializers don't work the same way as Xilnx'. Lattice has user-accessable flash in the Mach XO parts, Xilinx has such in Spartan 3AN, completely different access mechanisms, so don't pretend you can abstract that. Memory interfaces (DDR3 and such) are all different, with wizards for configuration and setting the zillion parameters each one seems to have. And then there is the interface to the interface. What is provided? Wishbone? AXI? PLB? Something else?
Even the simple stuff isn't portable. Here's an example.
I spent years doing Xilinx designs, and in the Xilinx world, you can initialize your flip-flops as such:
signal foo : std_logic_vector(7 downto 0) := X"AB";
What this does is immediately after configuration completes, the eight flip-flops that form the vector
foo are preset with the value
AB. This means, among other things, that an explicit reset is not necessary, as that's done as part of the configuration process (which happens at power-up or whenever otherwise forced). Certainly, a logic reset can be used as necessary, which leads to ...
A second thing about Xilinx is that they tell you that if you really need a (global) reset, you should always use synchronous resets, never asynchronous resets. The reason? The reset net's prop delay is "excessive" and to make sure that all flip-flops come out of reset at the same time, you should use the synchronous reset. The sync reset is synchronous to the clock and the timing analyzer knows how to properly determine whether the routing for it meets timing. (It's basically flip-flop to flip-flop like any other synchronous path.) And the good news is that the flip-flops can be configured so their reset inputs are synchronous or asynchronous. The synthesis tool does this automatically, and it doesn't use any extra resources. That is, the flop's D and CE inputs aren't involved at all with reset.
We started to use Microsemi FPGAs, the ProASIC-3E parts in particular.
ProASIC-3 Lesson 1. The VHDL initializers (to set or reset flip-flops at startup) are ignored; the fabric has no way to implement them. So you must reset all flip-flops.
Lesson 2. An external power-on or other explicit reset is required, as the states of each flip-flop at power-up are unknown, because there is no initialization from configuration memory and there is no GSR.
Lesson 3. The flip-flops support an asynchronous reset or preset only. They do not support a synchronous reset. To implement the synchronous reset, the synthesizer builds a mux with one input at the reset (or preset) value, selected by the reset signal, and that's combined with all of the other logic that drives the D and CE inputs. This makes your resource use explode. Yes, a lot of logic uses what appear to be synchronous resets, say, counter clears and suchlike, but that's not global to every flip-flop in the design, and that's usually coded as in addition to the global sync reset.
Lesson 4. The fabric doesn't require you to use a special reset input pin. Pick any pin that is convenient. But it is smart enough to recognize a reset as a large fan-out signal and it will put it on a low-skew global net. These nets are commonly used for clocks, but (very much unlike modern Xilinx parts) are accessible from the fabric, so any signal can drive them and they can connect to any logic-block input, not just clocks on flip-flops and RAMs. Because the reset is now on a low-skew net that can be driven by logic, you can easily synchronize it to your clock and then distribute it in a low-skew fashion to the entire design.
Lesson 5. Because the low-skew high-fanout global nets are available for general logic use and not just for clocks, the synthesis tool may detect that some signal or other has high fan-out and would benefit from being on a global net. That would seem to be a good thing, yes? For example, as design I'm finishing up now has a large mux that takes sixteen 16-bit data buses (from block RAMs) and muxes them into one 16-bit bus. The synthesis tool detected that the upper bit of the mux select had a high fan-out and put it on a global net. And it failed to meet timing, and by a ridiculous margin (something like 2.5 ns on a 100 MHz clock). (Wide muxes in the ProASIC-3 fabric are particularly ugly.) I looked at the timing analyzer to see why, and it showed the path from the counter that generated the mux select to one of the mux-output registers, and there was an oddball 6 ns (!) delay on one particular part of the path. It turns out that it put a mux-select line on the global net, and to get to the global buffer (which is on the edge of the chip) required a long route. Once the signal was on the global net the delay was short, but it was the route to the buffer that killed it. I had to greatly increase the fan-out limit in the synthesis tool so that it wouldn't do that.
So yeah, it would be great if it was reasonable to "write once, synthesize everywhere," but in practice, that isn't possible.