1) Chalk up another vote against HAL. What convoluted code aimed at achieving ... I don't know what. It certainly doesn't look like it would be much help in implementing the "standard" UART APIs that I'm used to (like a ring buffer.) It sorta looks like the _IT (non-blocking, interrupt driven) functions are a copy of the DMA functions. I guess that's OK, except that DMA is problematic for sporadic devices like UARTs, and I'd rather the DMA functions worked hard to duplicate the functionality of non-DMA functions.
2) In any case, using Receive_IT with a count of 1 seems like a bad idea; that means swapping buffer pointers around, disabling and enabling interrupts, and a bunch of other overhead for EVERY character received.
3) Calling the blocking Transmit() function from an ISR callback seems like a bad idea. If the UART HW buffer is full, the transmit could end up taking a full character time (with rx interrupts turned off.) I can't quite see how this would explain the behavior you're seeing for the test case you describe, but I can see similar things happening...
4) Normally I'd do echoing at user/process level (or perhaps "tty driver" level) rather than in the receive interrupt (or callback.) Largely because of reentrancy issues (like (3)), but also because I'll normally want echoing to be smarter (visual rubout, not echoing control characters, etc.) Since you're currently "reading" the characters just to echo them, I don't see how you'll be able to do something more useful with them as well.