If someone can find a specific example where directly instantiated primitives produces a demonstrably superior result then I'd like to see it.
I don't think I would ever be able to loosen your dogmatic view that no benefit can be had by using primitives. However, here are some things that might help:
- A 31-page overview of the PicoBlaze/KCPSM design, written by the designer is available at
http://bleyer.org/pacoblaze/picoblaze.pdf which talks about the trade off between architecture and resources may show the designer's original intention. It is not supposed to be portable, it was supposed to take maximum advantage of a given FPGA architecture.
- The design is Xilinx IP, and restricted by its license to only be used on Xilinx devices, so discussing portability to other vendor's FPGAs is a straw man. You are legally not allowed to do this.
- Synthesis tools are ignorant of the H/W layer, and like to do things like replace chains LTU-implemented shift registers, undoing the designers intent of synchronizing signals. Or if they do keep the registers as FFs, it does not put them as close as possible, lowering MTBF - unless you use some-what hit-and-miss vendor specific synthesis attributes.
- When memory blocks are inferred you don't know what they are called, so you can't use tools such as "memtool.exe" to replace memory contents without rebuilding the entire bitstream. As a workaround you can sometimes look at the final design and infer what the memory block was called but that is somewhat of a hack compared with just using a primitive.
- When you infer a 1024x18-bit memory that is initially all zeros it will be optimized away, so there will be no memory block you could push any desired contents into.
- A smaller table of filter constants may end up implemented in LUTs, limiting the performance of DSP blocks because the tools can no longer use dedicated routing paths. (That is unless of course you once again resort to vendor specific synthesis attributes).
- You may explicitly want a chain of flip-flops to be flip-flops, so they can be placed across the die to help with timing. This can have you reaching for vendor-specific compiler settings and once again, vendor specific synthesis attributes.
- There are perfectly functional designs that can be implemented using primitives and/or manual placement that cannot be correctly inferred. As an extreme example,how about "A 7.4 ps FPGA-Based TDC with a 1024-Unit Measurement Matrix"
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5424742/....
And just in case you say "but these can all be addressed with synthesis attributes and implementation options" that leaves you in a worse position than using primitives - your design now depends on fine-level details in your build environment, and the correct incantation to use which will be different between vendors, tools and even tool versions.
Why not just build a small library of behavioral descriptions of the missing Xilinx primitives, and then you can build the design on any vendor that lacks them. They are simple enough to model, and if you truly believe what you assert, the result should be the same and still optimal.