A neat solution would be to turn ESP8266 / ESP32 into a smart network processor with JSON-RPC API over UART.
I suppose it comes down to where you put that line. Putting it as "Network / Non-Network" makes sense.
The network stack on the ESP8266 via the AT modem mode over UART is very basic. I need to research more, but it seems as though to open a TCP connection, you send the AT commands to open a TCP connection in text. As to how it then connects the IO streams I don't know yet. I know there are AT command to "send n bytes" and probably "recv n bytes", but how then do you deal with multiple TCP connections? A lot to learn.
Anyway. I didn't like the example code people were showing/using. Their approach was to use that AT mode UART to treat the ESP solely as a "dumb" Wifi module. This meant the STM32 code contained all those horrible AT strings being transmitted over UART2. Of course the examples most show are a Request/Response like HTTP GET. That's such as easy state machine I expect if you want to consume bidirectional data such as for MQTT over a TCP socket and respond to events, not under run your buffers.... it's going to be a LOT harder to manage that, at that level.
So. As you suggest I move that line further and put ALL network related functionality on the ESP. In my case it can start with just Wifi+TCP connection+MQTT. All the STM32 doesn't even need to know the Wifi or MQTT network credentials, the ESP can be "self initied" as an AP and allow it to be configured over the web browser, like Tasmota. Once configured. It will reboot in "slave mode", establish a UART connection with the STM32 and construct IO buffers and DMA channels with interupts, such that a JSON messages can appear in the buffer and "commands" and "outbound messages" be placed in the outbound buffer.
Just need to devise that command set and figure out all the buffer/channels/intertupts.
A use case might be:
Connect Serial - handshake (versions)
Send CONNECT.
Wait.... on status.
Send SUBSCRIBE home/sensors/temperature/office
Wait.... on the interrupt for the receive channel.
Consume the buffer until it ends, parse the JSON, update the display.
Repeat.
So the only information the STM32 needs are the topics and the JSON message format, which is application specific and thus in the right place away from the network.
Other commands and services can be added later. Maybe for BLE.
And yes, you are right the ESP part becomes a "fungible part" resued in many similar projects out-of-the-box.