Tellurium I simply do not believe those numbers are meaningful.
Something else is going on.
My target is doing 1MB/sec on file transfers, and that is with a very sub-optimal low level input implementation whereby I have an RTOS task polling for input
/**
* This function is the ethernetif_input task. It uses the function low_level_input()
* that handles the actual reception of bytes from the network interface.
*
* This is a standalone RTOS task so is a forever loop.
* Could be done with interrupts but then we have the risk of hanging the xxx with fast input
* (unlikely if ETH is via a switch, but you could have a faulty LAN with lots of
* broadcasts) plus we have the issue of link status change detection in a thread-safe way.
*
*/
void ethernetif_input( void * argument )
{
struct pbuf *p;
struct netif *netif = (struct netif *) argument;
uint32_t link_change_check_count = ETH_LINK_CHANGE_COUNT;
// Define RX activity timer, for dropping fast poll down to slow poll
TimerHandle_t *rxactive_timer = xTimerCreate("ETH RX active timer", pdMS_TO_TICKS(ETH_SLOW_POLL_DELAY), pdFALSE, NULL, RXactiveTimerCallback);
// Start "rx active" timer
xTimerStart(rxactive_timer, 20); // 20 is just a wait time for timer allocation
do
{
p = low_level_input( netif ); // This sets rxactive=true if it sees data
if (p!=NULL)
{
if (netif->input( p, netif) != ERR_OK )
{
pbuf_free(p);
}
}
if (rxactive)
{
rxactive=false;
// Seen rx data - reload timeout
xTimerReset(rxactive_timer, 20); // Reload "rx active" timeout (with ETH_SLOW_POLL_DELAY)
// and get osDelay below to run fast
rx_poll_period=ETH_RX_FAST_POLL_INTERVAL;
}
// This has a dramatic effect on ETH speed, both ways (TCP/IP acks all packets)
osDelay(rx_poll_period);
// Do ETH link status change check
link_change_check_count--;
if (link_change_check_count==0)
{
// reload counter
link_change_check_count = ETH_LINK_CHANGE_COUNT;
// Get most recently recorded link status
bool net_up = netif_is_link_up(&g_xxx_netconf_netif);
// Read the physical link status
ethernetif_set_link(&g_xxx_netconf_netif);
// Has the link status changed
if (net_up != netif_is_link_up(&g_xxx_netconf_netif))
{
ethernetif_update_config(&g_xxx_netconf_netif);
if (net_up) {
// Link was up so must have dropped
debug_thread_printf("Ethernet link down");
}
else {
// Link was down so must be up - restart DHCP
debug_thread("Ethernet link up");
xxx_network_restart_DHCP();
}
}
}
} while(true);
}
There simply aren't several seconds' worth of
internet transactions involved in the TLS setup. The CPU time is spent firmly in the RSA or EC crypto. So your 10x difference could well be RSA versus EC.
This is a very old issue, which is why there are chips which do hardware RSA, in ~100ms. If you could do it in 400ms, then a modern 500MHz arm32 would do RSA in 100ms in software, EC much faster, and nobody would be using those coprocessor chips (well except for "secure" private key storage).
Can you supply actual numbers for your RSA and EC code, and at what MHz.
Other apps I have running that use sockets show no performance issues. For example an MbedTLS application (HTTPS client) does a file download at a decent speed. I personally use the Netconn API of LWIP which is below the socket layer but still...
STM32 ethernet driver implementation is famous for its bloat and low quality. It is infested with bugs which ST does not manage to fix for years.
Where exactly is the "bloat and low quality"? The 32F4 ETH subsystem runs a linked list of packets (I've not studied it in detail; it is very complicated and I have better things to do). The code I posted interfaces between the buffer structure of LWIP and this packet list. It doesn't look like it can be made much shorter.
There are zero-copy versions posted on the ST forum, from "Piranha", but I am not sure anybody understands them, and he appears to have vanished. I too was concerned about the data moving, so did done timing measurements with a scope and the software copy my (ST) code uses takes an insignificant time. It's hard to see where much time can be saved. Replacing the memcpy() with DMA would have made no difference in any likely IOT application.
As regards bugs, the guy who did this project originally spent maybe a month full time equivalent patching the code from bug reports found all over the internet, which indeed is crap from ST, but that is historical.
Regarding LWIP config, yes, very true, it is poorly documented - symptomatic of a project which started ~17 years ago, the devs have long ago got themselves girlfriends and moved on

and nobody wrote up a "how to get started quick" guide for it. But the 17 years also works in its favour: bugs should have been worked out of it, over literally millions of devices in the field. And it is very unlikely that everybody is using the same 10% of the functionality i.e. concealing major bugs.
That could well be true for MbedTLS though. I spent ages googling for info on how to set up the buffer sizes etc, testing and documenting it. It's actually easy if you throw a lot of RAM at it (say 50k) but I am not doing that. A google for lwipopts.h yields loads of hits and eventually you find some useful info... A crap situation but typical of open source software! The alternative is closed commercial sources, or low deployment number sources where one doesn't really know how well they have been tested.