Their is a difference in latency between the 'pa_out' signal and the 's_lut_out', because the lookup table's output is registered.
lut_complementer # (
.N(17)
)
C2 (
.clk(clk), .rst(rst), .msb_in(pa_out[11]),
.comp_in(s_lut_out), .comp_out(s_out)
);
You want to delay pa_out[11] by one cycle.