The clock is too far off to be reliable.
First, the code in your first post rounds always down at the prescaler division, this causes too large error. Either hard-code the result or fix the formula - add half the divider before dividing, ie #define UBRR_VALUE ((((F_CPU + (USART_BAUDRATE * 8UL)) / (USART_BAUDRATE * 16UL))) - 1). This small off-by-one gets worse as the baud rate is raised and the prescaler is reduced.
The real problem is that the baud rate*16 does not divide to clock rate. Looking at the AVR datasheet, the error at UBRRn=8 is -3.5% and this is too large to work reliably. It depends on the actual UART block used on both sides; better UARTs perform clock recovery and are able to cope with clock rate errors.
In case of simple UARTs:
* you have start bit + 8 data bits + stop bit = 10 bits
* simple UARTs are sampling the receive pin at the middle of the bit: you do not want more than half a bit of clock shift at the end of the tenth bit or you'll miss that
* thus the clock may be off by +-5%.
More advanced UARTs sample multiple times (ie AVR samples 3 times); then the boundaries get narrower. The AVR datasheet specifies the worst case as +-4.5%. Anyway, you must take into account that the other end may be off to the other side and things still get broken. Unless both ends are your own hardware, the rule of thumb is that you should not go over half of that tolerance and leave the other half to the other device. (if both ends are your own HW, then you can use whatever non-standard baud rate that works out well and it is a non-issue anyway). So when going above 2% of error, prepare for trouble.
AVR UART at 115200 baud from 16MHz osc has 3.5% error. That is way too bad. You may have better luck at U2X=1; then the baud rate is only 2.1% off, but the receiver has triple-sampling points set at more wide locations and the receiver is more tolerant to error (but the error is less). The TX is definitely better; RX may be better or worse. If this is a debug UART, tx-only, then this would be a solution.
A better approach would be either:
* Swap the crystal on the board to one that divides better to 115200*16. For example, 18.4320 or 14.7456MHz.
* Use better baud rate. If you do not have too much data and can take the reduction of transfer rate, go to a lower baud rate with less error (see table in DS). If not, go to 250 or 500kbaud, that divide evenly from 16MHz. You may have issues at those rates with some UARTs or USB-serial adapters, some do not go well above 230kbaud.
If the UART is hard-wired to a known chip, check the datasheet of that end as well!