The gap between messages must be at least 3.5 character times. This is a well defined, unambiguous feature that all devices on the bus can agree on, even if it's a pain to implement because it requires both a UART and a timer, and a device driver that coordinates the two in a coherent way. It's OK. Difficult is OK if it means the end product is more reliable and easier to use.
This causes colossal pain when trying to interface modbus devices through general purpose operating systems and UART adapters.
When actually designing a MODBUS device (master or slave), with a microcontroller for example, this is all trivial and IMHO a very good strategy. In 1979, they probably did not think about how bloated all OS abstractions will be 40 years later, and how difficult it will be to do a few even approximately timed IO operations.
It allows for registers to be combined into 32 bit ints or floats... fine. But again, it fails to specify which byte is which, so there's at least 4 different vaguely sensible ways to transfer a simple 32 bit value from one device to another, with probing and random byte swapping being the only real way to make both master and slave agree on one of them. Why?
It is indeed hilarious how automation electricians who normally install wiring and so on, have to configure these test registers and probe for the correct byte order without being any kind of computer scientists/engineers at all, I have witnessed this happening. Talk about low level leaking into higher abstractions!
And who thought it was a good idea for replies not to include any kind of context? The master always has to remember the last request issued, because responses from slave devices are just a string of numbers - there's no way to know what those numbers mean without looking back to see what the last request was. Even a unique sequence number - one byte! - would at least allow the master to confirm that the numbers being sent do actually relate to a specific numbered request. (Imagine a scenario... "M: What's our altitude?" ... <long delay> ... "M: What's our fuel level?" ... "S:<7E 29 00 01>")
Modbus is an ancient protocol, implemented on ancient devices, so the idea was probably: reply is a simple register read and comes immediately after the poll, otherwise master times out, and slave will not reply after a certain time, so mixup can't happen. The problem is, this idea, or timeout value was not specified. When given maximum turn-around time (time until slave activates the RS485 TXEN and sends start bit) like 1 millisecond, no such problem would exist, and then everything would be simple, no need to add the context id and compare it. But as you say, they did not feel like specifying pretty much anything.
I can totally see engineers at Schneider or whoever it was trying to write a proper specification and then manager types going like... "no, that is too verbose, remove that, no, don't say that, that doesn't sound cool!"