Possibly dumb question.... Did you consider in the original code that:
assign fifo_full = (fifo_size >= 3'd3);
Could be:
assign fifo_full = (fifo_size >= 3'd3) || shift_out;
...as when the consumer is pulling data from the FIFO you should always be able to insert a new item?
Also, it occurred to me that with such a small FIFO you could consider just a 6-input lookup table finite state machine - four bits of state with the shift_in and shift_out signals indexing a table of the next state (so 4 x LUT6s) and seven or so LUT4s to MUX control signals and outputs, along with the regs to hold the data? With a little bit of care in selecting the state so that two of the bits corresponded with the read index it seems could be implemented in two levels of logic on a LUT6 architecture.
Correction:
assign fifo_full = (fifo_size >= 3'd3) && ~shift_out;
Thanks. I completely missed that. With that 1 extra word in my fifo, I was expecting the module feeding the fifo to have a delayed response to the full flag, however, you are absolutely right.
Will be added to V3.
As for your second half, my V3 single 2 input mux for the data_out which will be the fastest possible. All other regs have solely a 1 or 2 input source control while the 4/8word array will be left to the compiler how it feeds my new 1 register before that final output 2:1 AB mux. The output flags will be a single logic cell, no additional fanout other than the fifo_not_empty being a 2 input OR gate, 1 input coming again from a single logic cell with no additional fanout (now for fifo_full as well).
I'm writing it in a way which it still appears like the fifo memory is a normal array, but separated that array by an additional register like what you see in dual-port rams where you get a performance boost by registering the data outs.
The final 2:1 AB mux will only have 1 control registered logic cell with no additional fanouts. 1 side of the mux, connected to my fifo memories new register output will have no additional fanouts on that reg's data output. The only thing I cannot control is where I'm receiving the data_in from the user who is feeding me.
The end goal is to eliminate every bottleneck possible. Just drop in my fifo and forget as there would be no faster way to get that 0 clk first word fall through capability without any determent to the user's existing design.
As for the finite state machine, if your logic has few enough bits that it will fit, the compiler does coverts your logic to a state machine if will improve design efficiency and it can.