A statement that has been repeated here many times is how such processors waste flash, because the limitations of the architecture lead to bloated programs. They do lead to bloated programs - but some people seem to be unaware that the "terrible 3 cent microcontroller" does NOT HAVE any flash. Those devices are one-time-programmable. When programming them with high voltage, internal fuses are burnt and you cannot erase them or re-program them.
That's irrelevant to the principle of the discussion. Storage for the program costs money, regardless of the particular form it takes -- OTP, mask ROM, EPROM, flash, SRAM. The ratio of the cost of a bit of program space to a gate in the CPU core varies, but probably not by all that much -- and flash is probably the cheapest.
Regardless of the relative costs, above some program size the cost of an inefficient program coding outweighs the cost of making a better CPU core. That point might be at 1000 instructions, at 10,000 instructions, or at 100,000 instructions.
I would be surprised if it lies outside that range.
Especially when we're talking about the difference between, say {8080, 6502, 8051, PIC} vs say low complexity implementations of {PDP-11, MSP430, CM0, RV32E}
I've been trying to understand where the point is that it's worth adding the "C" extension for 16 bit opcodes to a RISC-V core. The 16 bit to 32 bit decoder takes something like 400 6-LUTs and saves 25%-30% from program size. As a crude estimate, if you used those LUTs as program memory instead then that's about 3 KB of RAM-equivalent. To save 3 KB of program size you'd need the RV32I program to be 10 KB to 12 KB, or 2500 to 3000 instructions.