Author Topic: Anyone here familiar with LWIP? (Read 17070 times)

dare · « **Reply #75 on:** February 04, 2023, 09:56:29 pm »

Quote from: peter-h on February 04, 2023, 09:32:02 am

The 2nd URL above says the RAW API is also thread-safe if core locking=1.

To avoid any potential confusion, readers should know that this statement is not true, per se. That is, the raw (a.k.a. "low-level") APIs are not inherently thread-safe in that their implementations do not contain code to handle contention between threads. However, when the LwIP core locking feature is enabled (LWIP_TCPIP_CORE_LOCKING = 1), these APIs may be called from a thread other than the LwIP "tcpip" thread by acquiring the global core lock before calling the function. When used this way, it is entirely up to the caller to ensure that the lock is acquired and released at the correct times. (It is also important to know that the raw API is callback-based, and that all callbacks that occur will happen on the tcpip thread, regardless of which the thread made the original API call).

Finally, a note about terminology from the LwIP docs:

Do not confuse the lwIP raw API with raw Ethernet or IP sockets.
The former is a way of interfacing the lwIP network stack (including
TCP and UDP), the later refers to processing raw Ethernet or IP data
instead of TCP connections or UDP packets.

peter-h · « **Reply #76 on:** February 04, 2023, 11:02:33 pm »

Thanks

I was reading this up today and especially in light of calling pbuf_free() from an RTOS task which is not the TCP/IP task, here

Code: [Select]

    do
    {

		p = low_level_input( netif );	// This sets rxactive=true if it sees data

		if (p!=NULL)
		{
			if (netif->input( p, netif) != ERR_OK )
			{
				pbuf_free(p);
			}
		}

		if (rxactive)
		{
			rxactive=false;
			// Seen rx data - reload timeout
			xTimerReset(rxactive_timer, 20);	// Reload "rx active" timeout (with ETH_SLOW_POLL_DELAY)
			// and get osDelay below to run fast
			rx_poll_period=ETH_RX_FAST_POLL_INTERVAL;
		}

		// This has a dramatic effect on ETH speed, both ways (TCP/IP acks all packets)
		osDelay(rx_poll_period);

		// Do ETH link status change check (code not shown for brevity)

    } while(true);

Now testing with

Code: [Select]

// Flag to make LWIP API thread-safe. The netconn and socket APIs are claimed
// to be thread-safe anyway. The raw API is never thread-safe.
// A huge amount of online discussion on this topic; most of it unclear, but
// ON (1) seems to be recommended, as being more efficient. However, it was
// found that OFF (0) greatly reduces weird data loss errors under heavy
// multiple TCP/IP RTOS task load.
#define LWIP_TCPIP_CORE_LOCKING   	0

// If LWIP_TCPIP_CORE_LOCKING=0 then these two need to be 1
// See [url]https://www.nongnu.org/lwip/2_1_x/multithreading.html[/url]
#define LWIP_ALLOW_MEM_FREE_FROM_OTHER_CONTEXT 1
#define SYS_LIGHTWEIGHT_PROT            1

Looking at the actual code on e.g. a socket write, when LWIP_TCPIP_CORE_LOCKING=1, the whole API call is locked and the granularity is pretty poor, which should still work but probably doesn't help...

peter-h · « **Reply #77 on:** February 08, 2023, 09:33:30 am »

Just had loads of fun and made some interesting progress.

There were two reasons for the multi second gaps:

- broadcast packets (2 buffers, one with a bcast and one with data, messes up LWIP) - fundamental reason not established but bcasts are now filtered, and there are 4 rx buffers

- RTOS task priority

On the 2nd one:

Notes from my ref manual:

Most projects can be written with all "application" tasks running at tskIDLE_PRIORITY, yielding to RTOS when waiting for something.

One exception to the above is where you have higher priority task(s) running which are not yielding, and you do not want execution gaps in the lower priority tasks. In the xxx this happens if TLS is used. TLS is a complex monolithic task which performs crypto operations lasting up to several seconds. It runs in the priority of whichever task called TLS, and typically this is network_task (LWIP itself, priority osPriorityNormal = 24). If these gaps would be an issue, your task priority needs to be same or higher than network_task (24). Where TLS is not used, the task priority can be all the way down to tskIDLE_PRIORITY, even where networking is used. Furthermore, due to the structure of LWIP, networkin application priority should not be higher than LWIP (24) since it might cause blocking on socket writes.

The above is because TLS is being called from the LWIP task, basically. Somebody else may have done it differently... maybe, with inter-task message passing, but why? I suppose there is an argument for running TLS at a very low priority, if you are happy with gaps in the comms / slow comms, but don't want TLS causing the multi second gaps.

The other approach is to rehack TLS so it yields periodically. Then you have a whole loooad of fun rehacking TLS every time you want to update it

And ours has already been hacked around to process a string of certificates from a file without needing > 200k of RAM to hold the whole current cert chain.

peter-h · « **Reply #78 on:** February 09, 2023, 09:15:36 pm »

And here is another puzzle. Two sections of the ETH - LWIP low_level_* code. Each one contains two memcpy calls. In each case, the 1st call of the two is never taken...

low_level_input:

Code: [Select]


	// Load the packet (if not rejected above) into LWIP's buffer
	if (p != NULL)
	{
		dmarxdesc = EthHandle.RxFrameInfos.FSRxDesc;
		bufferoffset = 0;

		for(q = p; q != NULL; q = q->next)
		{
			byteslefttocopy = q->len;
			payloadoffset = 0;

			/* Check if the length of bytes to copy in current pbuf is bigger than Rx buffer size */
			while( (byteslefttocopy + bufferoffset) > ETH_RX_BUF_SIZE )
			{
				/* Copy data to pbuf */
				memcpy( (uint8_t*)((uint8_t*)q->payload + payloadoffset), (uint8_t*)((uint8_t*)buffer + bufferoffset), (ETH_RX_BUF_SIZE - bufferoffset));

				/* Point to next descriptor */
				dmarxdesc = (ETH_DMADescTypeDef *)(dmarxdesc->Buffer2NextDescAddr);
				buffer = (uint8_t *)(dmarxdesc->Buffer1Addr);

				byteslefttocopy = byteslefttocopy - (ETH_RX_BUF_SIZE - bufferoffset);
				payloadoffset = payloadoffset + (ETH_RX_BUF_SIZE - bufferoffset);
				bufferoffset = 0;
			}

			/* Copy remaining data in pbuf */
			memcpy( (uint8_t*)((uint8_t*)q->payload + payloadoffset), (uint8_t*)((uint8_t*)buffer + bufferoffset), byteslefttocopy);
			bufferoffset = bufferoffset + byteslefttocopy;
		}
	}

low_level_output:

Code: [Select]


		/* Check if the length of data to copy is bigger than Tx buffer size*/
		while( (byteslefttocopy + bufferoffset) > ETH_TX_BUF_SIZE )
		{

			// Copy data to Tx buffer - should use DMA but actually the perf diff is negligible
			memcpy_fast( (uint8_t*)((uint8_t*)buffer + bufferoffset), (uint8_t*)((uint8_t*)q->payload + payloadoffset), (ETH_TX_BUF_SIZE - bufferoffset) );

			/* Point to next descriptor */
			DmaTxDesc = (ETH_DMADescTypeDef *)(DmaTxDesc->Buffer2NextDescAddr);

			/* Check if the buffer is available */
			if((DmaTxDesc->Status & ETH_DMATXDESC_OWN) != (uint32_t)RESET)
			{
				errval = ERR_USE;
				goto error;
			}

			buffer = (uint8_t *)(DmaTxDesc->Buffer1Addr);

			byteslefttocopy = byteslefttocopy - (ETH_TX_BUF_SIZE - bufferoffset);
			payloadoffset = payloadoffset + (ETH_TX_BUF_SIZE - bufferoffset);
			framelength = framelength + (ETH_TX_BUF_SIZE - bufferoffset);
			bufferoffset = 0;
		}

		/* Copy the remaining bytes */
		memcpy( (uint8_t*)((uint8_t*)buffer + bufferoffset), (uint8_t*)((uint8_t*)q->payload + payloadoffset), byteslefttocopy );
		bufferoffset = bufferoffset + byteslefttocopy;
		framelength = framelength + byteslefttocopy;
	}

dare · « **Reply #79 on:** February 09, 2023, 11:48:49 pm »

Quote from: peter-h on February 09, 2023, 09:15:36 pm

In each case, the 1st call of the two is never taken

This is because the size of packets you are sending/receiving are smaller than the size of the MAC buffers you are using to hold them (as specified by ETH_TX_BUF_SIZE and ETH_RX_BUF_SIZE values).

Indeed, based on what you posted earlier, this will always be the case, since ETH_TX_BUF_SIZE and ETH_RX_BUF_SIZE are both set to ETH_MAX_PACKET_SIZE. So the while loop in both functions is unnecessary.

Relatedly, the for loop in low_level_input() is also unnecessary. This is because the size of the buffers in the pbuf pool (as set by PBUF_POOL_BUFSIZE) is larger than the maximum Ethernet frame that can be received. Therefore, when pbuf_alloc() is called to allocate a buffer of the desired size (len), it will only ever return a single buffer (i.e. a buffer where buf->next == NULL).

dare · « **Reply #80 on:** February 10, 2023, 12:57:11 am »

Quote from: peter-h on February 08, 2023, 09:33:30 am

The above is because TLS is being called from the LWIP task, basically.

If this is true, it means you're running mbedTLS in an extremely inefficient mode. Typically, when using mbedTLS with LwIP in an RTOS context (i.e. NO_SYS==0), mbedTLS runs in the application task. In this configuration, mbsdTLS calls the LwIP sockets API to send and receive TLS "records" (a.k.a. messages) over the underlying transport connection. In this way, all TLS computations, including crypto operations, are performed in the application task, and the lwip tcpip task is solely responsible for attending to the TCP state (e.g. sending/receiving ACKs, performing retransmissions, etc.).

There is a configuration that allows the use of mbedTLS directly on top of the LwIP raw API. This mode uses the altcp_tls API. Because its built on the raw API, the atltcp_tls API can only called from the tcpip task. Unfortunately, this configuration is really only suitable for very simplistic applications, and even then it is sub-optimal.

As you've observed, the fundamental problem is that the crypto math takes a long time. This is especially true on the STM32F4s which have no acceleration for RSA or EC operations. What this means is that, while the tcpip task is performing a crypto operation it is unable to perform any of the other operations for which it is responsible. As a consequence, if you have other application tasks which are using the sockets API, these tasks will be blocked waiting for the tcpip task to finish the crypto op and attend to the socket request. Even in a system where there are no other application tasks, and all network activity happens in the tcpip task, running the crypto code in the tcpip task itself is a bad idea, because means that other normal TCP activities like sending delayed ACKs or retransmissions will not happen in a timely manner, leading to poor throughput. Yielding from inside the TLS code wont help with this problem, since the tcpip thread is trapped inside the crypto code and can't get back to its main loop to service other activities.

peter-h · « **Reply #81 on:** February 10, 2023, 07:31:43 am »

Quote

This is because the size of packets you are sending/receiving are smaller than the size of the MAC buffers you are using to hold them (as specified by ETH_TX_BUF_SIZE and ETH_RX_BUF_SIZE values).

I tried to create large packets (I am testing with various programs running e.g. an http server doing file downloads and uploads) and nothing triggers that condition, which makes me wonder if the condition

while( (byteslefttocopy + bufferoffset) > ETH_TX_BUF_SIZE )

can ever be met. Is there any point in having ETH_TX_BUF_SIZE smaller, anyway? It would impose a limit on the MTU, for a start.

I struggle to see where the size of the buffer is defined. For RX, it should be the (already negotiated) MTU, and ETH_RX_BUF_SIZE is already equal to the largest possible "standard" MTU. For TX, it is whatever LWIP sends to ETH which is probably going to be again the MTU configured in lwipopts.h.

So that code will get run only if somebody is running with small ETH packets.

I am using a large PBUF_POOL_BUFSIZE (1514) but still just under ETH_RX_BUF_SIZE.

Relative to this
https://www.eevblog.com/forum/microcontrollers/gcc-compiler-optimisation/msg4692329/#msg4692329
I am wondering if there is something I can do to improve the chance of 4-aligned memcpy transfers. The ETH_ buffers are aligned but I am not sure about the LWIP ones. Obviously a sequence of 1514 byte buffers won't be. I went to 1516.

The payloadoffset is not usually going to be 4-aligned (not aligned 75% of the time) but the buffer itself could be.

Quote

the for loop in low_level_input() is also unnecessary. This is because the size of the buffers in the pbuf pool (as set by PBUF_POOL_BUFSIZE) is larger than the maximum Ethernet frame that can be received

Even more so with 1516

Amazing. But I will leave the code in there...

An original version of the Cube code had PBUF_POOL_SIZE=8 and PBUF_POOL_BUFSIZE=512. I spent a lot of time playing with this and ETH performance and found the 8x512 to have no advantage. AIUI LWIP would just concatenate 3 of those for large packets. But even running with this setting for a bit, I found that the 1st memcpy is still never taken. Could be a timing thing.

There is a weird reason for using large pbufs: my simplistic http server is assuming that the entire "string of interest" (the GET... etc) is contained in one packet from LWIP. This works fine on a LAN, which is all that that server is for (local config). I need PBUF_POOL_BUFSIZE > ~500 to receive this kind of stuff in one go

Code: [Select]

 * PUT /ufile=BOOT.TXT HTTP/1.1..Host: 192.168.3.73:81..User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64;
 * x64; rv:103.0) Gecko/20100101 Firefox/103.0..Accept: .....Accept-Language: en-US,en;q=0.5..Accept-En
 * coding: gzip, deflate..Content-Type: text/plain;charset=UTF-8..Content-Length: 198..Origin: http * p
 * ://192.168.3.73:81..Connection: keep-alive..Referer: http://192.168.3.73:81/efile?BOOT.TXT....boot
 * time: 2022-08-02 15:39:28.app name: appname_1.1
 *
 * The likelihood of the 1st packet containing the entire header including the CRLFCRLF
 * is dependent on the value PBUF_POOL_BUFSIZE on lwipopts.h, and on the program sending
 * the data (the browser).
 *

Quote

If this is true, it means you're running mbedTLS in an extremely inefficient mode

I think I was wrong. The application calls the MbedTLS API which in turn calls LWIP.

Quote

Yielding from inside the TLS code wont help with this problem, since the tcpip thread is trapped inside the crypto code and can't get back to its main loop to service other activities.

What I observe by waggling GPIO pins and watching on a scope which tasks are running is that a task whose priority is same (or higher - I am not sure that is a good idea for other reasons, so I am staying with "same") as LWIP's does not get blocked by the RSA/EC delay even if that task is using LWIP. But those tasks are merely using LWIP, not using TLS.

The tasks I have (2 of them) which are using TLS do get blocked and must obviously get blocked because RSA/EC is the session setup, before data can be sent. No way around that (other than a coprocessor).

The problem with blocking a non-TLS "internet task" for 3 secs is that it makes things time out all over the place. This appears to be avoided by matching its priority to LWIP's.

However this doesn't appear to work for LWIP applications which use the Netconn API. They get hung up. Socket API applications seem to run OK through the gaps produced by TLS crypto. That seems strange because the socket API uses the netconn API; this is clear when one steps through it.

dare · « **Reply #82 on:** February 10, 2023, 05:36:52 pm »

Quote from: peter-h on February 10, 2023, 07:31:43 am

I tried to create large packets (I am testing with various programs running e.g. an http server doing file downloads and uploads) and nothing triggers that condition, which makes me wonder if the condition

while( (byteslefttocopy + bufferoffset) > ETH_TX_BUF_SIZE )

can ever be met.

In your case I believe this expression will never be true. Because LwIP has been configured with an Ethernet MTU of 1500 (payload) bytes, the largest frame that LwIP will attempt to send (and therefore the largest value for byteslefttocopy) is 1518 bytes. The ST ETH driver sets ETH_TX_BUF_SIZE to ETH_MAX_PACKET_SIZE, which in turn is set to 1524 bytes (see stm32f4x7_eth.h). (Technically, the Ethernet max frame size is 1522; the driver code includes 2 "extra" bytes for reasons I haven't investigated). Thus the expression will never be true.

Quote

Is there any point in having ETH_TX_BUF_SIZE smaller, anyway? It would impose a limit on the MTU, for a start.

Having a smaller buffer size would not limit the MTU, because the MAC allows Ethernet frames to be span multiple buffers. Indeed, that's what the while loop is for--handling the case where the frame needs to be copied into more than one TX buffer. The benefit of smaller TX buffers is potentially more efficient memory usage in some cases. Many TX packets will be smaller than the Ethernet MTU (e.g. TCP ACKs). Having your total TX buffer space split up into smaller individual buffers allows more packets to be on the TX queue simultaneously. This might allow you to get away with less overall TX buffer memory in the case where you have many simultaneous TCP connections all vying for a spot on the TX queue. That said, if one is really concerned about reducing overall buffer memory size, the optimal answer is to do DMA directly into/out of the LwIP buffers themselves.

Quote

I struggle to see where the size of the buffer is defined. For RX, it should be the (already negotiated) MTU, and ETH_RX_BUF_SIZE is already equal to the largest possible "standard" MTU. For TX, it is whatever LWIP sends to ETH which is probably going to be again the MTU configured in lwipopts.h.

As you state, for inbound packets the max size will be the standard max Ethernet frame size (1522), possibly reduced by the negotiated TCP MSS. For outbound packets, the max size will be limited by the smaller of: the mtu value set on the netif, the value set for TCP_MSS in lwipopts.h (for TCP only), and the negotiated TCP MSS (again, for TCP only).

Quote

The ETH_ buffers are aligned but I am not sure about the LWIP ones. Obviously a sequence of 1514 byte buffers won't be. I went to 1516.

All the buffers in the LwIP pbuf pool are automatically aligned to whatever MEM_ALIGNMENT is set to. (This is thanks to the convoluted preprocessor logic in memp.h, memp_std.h and memp.c). This means on the RX side the memory copy from the ETH RX buffer to the LwIP pbuf will always be aligned.

The TX side is more complicated. Depending on how your application sends its data, the outbound Ethernet frame as seen by low_level_output() may in fact be split up over multiple pbufs (thus requiring multiple iterations of the for loop to copy). This should only happen in cases where the application chooses the no-copy option when sending data. In those cases, one of the pbufs in the chain will point directly to the application's data buffer. Thus the alignment of this data will depend on whether the application's buffer was aligned to begin with.

In most cases the data in pbufs will be aligned. So its best to use a copy function that is optimal for aligned data, but can cope with copying unaligned data.

Quote

There is a weird reason for using large pbufs: my simplistic http server is assuming that the entire "string of interest" (the GET... etc) is contained in one packet from LWIP. This works fine on a LAN, which is all that that server is for (local config). I need PBUF_POOL_BUFSIZE > ~500 to receive this kind of stuff in one go

Be careful, because this depends on how the sending system chooses to write the HTTP header data. There's nothing in the HTTP spec that requires the sender to transmit the head of an HTTP request in a single TCP segment, even if the TCP MSS is big enough to support this. The only safe approach is to accumulate the received data into a buffer until the end-of-head marker (\r\n\r\n) is seen.

Quote

The problem with blocking a non-TLS "internet task" for 3 secs is that it makes things time out all over the place. This appears to be avoided by matching its priority to LWIP's.

For good comms performance, typically you want your tcpip task to be higher priority than your application tasks, but lower priority than your timing-critical tasks (e.g. a task delivering audio to a DAC). The tcpip task should never block on anything other than its input work queue (and possibly the tcpip global lock, if core locking is enabled). The only CPU intensive work in the tcpip task is data copying, which should be limited by overall network bandwidth.

Quote

However this doesn't appear to work for LWIP applications which use the Netconn API. They get hung up. Socket API applications seem to run OK through TLS crypto.

Internally, the sockets API is just a veneer over the netconn API. So something else is going on here.

dare · « **Reply #83 on:** February 10, 2023, 06:05:02 pm »

Quote from: peter-h on February 10, 2023, 07:31:43 am

Relative to this
https://www.eevblog.com/forum/microcontrollers/gcc-compiler-optimisation/msg4692329/#msg4692329
I am wondering if there is something I can do to improve the chance of 4-aligned memcpy transfers.

One more point on this: if useful, you can override LwIP's LWIP_DECLARE_MEMORY_ALIGNED macro to include explicit compiler directives, including alignment directives [__attribute__((aligned(4)))] and/or section directives [__attribute__ ((section ("LWIPBUFFERS")))].

peter-h · « **Reply #84 on:** February 10, 2023, 08:49:26 pm »

Quote

Be careful, because this depends on how the sending system chooses to write the HTTP header data. There's nothing in the HTTP spec that requires the sender to transmit the head of an HTTP request in a single TCP segment, even if the TCP MSS is big enough to support this. The only safe approach is to accumulate the received data into a buffer until the end-of-head marker (\r\n\r\n) is seen.

Yes, I know. It was a conscious hack. I have now done a small mod, in the form of a 200ms delay:

Code: [Select]

	/* Create a new TCP connection handle */
	conn = netconn_new(NETCONN_TCP);
  
	if (conn!= NULL)
	{
		/* Bind to port 80 (unless modified in config.ini) with default IP address */
		err = netconn_bind(conn, NULL, http_server_port);
    
		if (err == ERR_OK)
		{
			// Put the connection into LISTEN state
			netconn_listen(conn);

			while(true)
			{
				// Accept the connection after netconn_listen
				// This function reports if there is some rx data (non blocking)
				err_t res = netconn_accept(conn, &newconn);

				if (res == ERR_OK)
				{
					// Something arrived from the client
					debug_thread("http incoming connection");

					// Wait a bit for a complete request to arrive
					// This is a hack; ideally we should wait for end of each 'body'.
					osDelay(200);

					// Respond to the client request
					http_server_serve(newconn);

					// Delete the connection
					netconn_delete(newconn);
				}
			}
		}
		else
		{
			debug_thread("cannot bind netconn");
		}
	}

A better way would be to somehow get how much data has arrived and wait until that figure has stopped rising, for say 200ms.

peter-h · « **Reply #85 on:** February 12, 2023, 11:30:27 am »

Just spent a couple of days playing with task priorities...

I have a couple of tasks which use LWIP and which get suspended for a few seconds during TLS RSA/EC crypto. One (a primitive http server) uses Netconn and the other (a serial to TCP data copy process) uses sockets.

I also have a number of tasks which don't do any networking and which run as they should, throughout. One of these copies a RAM buffer to some 7 segment LEDs, via an SPI LED controller. Experimentation of what priority these need is difficult but it looks like it needs to be at/above the tasks which invoke TLS. If their priority is 0 then TLS hangs them up as well.

After much experimentation with RTOS priorities, this is what I found, and I wonder if it is right:

TCP/IP applications (whether using the LWIP Netconn API or the LWIP socket API) should run at a priority equal to or lower than that of LWIP itself which [in this project] is osPriorityHigh (40). TCP/IP applications can be run with a priority all the way down to tskIDLE_PRIORITY (0).

The exception is if TLS is in use. TLS does not yield to the RTOS; you get a solid CPU time lump of ~3 secs (STM 32F417, hardware AES only). TLS starts in the priority of the task which invokes it, but subsequent TLS-driven TCP/IP operations run at the priority of LWIP. So when TLS is doing the session setup crypto, tasks of a priority lower than LWIP get suspended. If this gap is an issue, the priority of the relevant tasks should be equal to LWIP's. Furthermore, due to the structure of LWIP, the priority of a task using it should not be higher than LWIP (24) since it might cause blocking on socket (or netconn) writes.

Does this make sense?

It looks like LWIP blocks all netconn and socket ops when TLS is using it. Is that possible?

I am running with

#define LWIP_TCPIP_CORE_LOCKING 0
#define LWIP_ALLOW_MEM_FREE_FROM_OTHER_CONTEXT 1
#define SYS_LIGHTWEIGHT_PROT 1

as this was found to have much better granularity for task switching. With LWIP_TCPIP_CORE_LOCKING=1 you end up with a crude mutex across the entire API call, which is fine in a single RTOS task.

Thank you in advance for any pointers.

dare · « **Reply #86 on:** February 12, 2023, 04:49:37 pm »

Quote from: peter-h on February 12, 2023, 11:30:27 am

The exception is if TLS is in use. TLS does not yield to the RTOS; you get a solid CPU time lump of ~3 secs (STM 32F417, hardware AES only).

When you say a "solid lump" what exactly do you mean? Do you mean literally nothing else of equal priority gets to run? This shouldn't be happening. To wit: if you have two tasks running at the same priority, one that uses TLS and one that doesn't, and both tasks are purely computation bound (i.e. no blocking on synchronization objects), then you should see time-slicing between the two tasks at the interval of the system tick. If you're not seeing this, it means: 1) your other tasks are actually running at a lower priority than the TLS task, 2) the TLS task is holding a lock which it shouldn't, or 3) preemptive tasking has been disabled in your RTOS.

Quote

TLS starts in the priority of the task which invokes it, but subsequent TLS-driven TCP/IP operations run at the priority of LWIP.

I don't see how this is possible. Given the way the LwIP socket and netconn APIs work**, all TLS computation should be happening solely in the application thread. Furthermore, when that computation occurs, the application thread should not be holding any synchronization object that the LwIP tcpip thread needs. Therefore, you should never have a situation where priority is inherited from the tcpip task to the application task (i.e. where the OS boosts the priority of the app task because it happens to be holding a mutex that the tcpip task needs).

** To expand on this a little further: the LwIP socket and netconn APIs (but not the raw API) are largely "down-call" APIs, meaning that interactions across the API happen by the API consumer calling a function, and LwIP performing some desired action and/or returning some data by the time that function returns. For the most part, there are no "up-calls" in these API, where LwIP calls back into the application's code.

The only exception to this model is the optional netconn "callback" feature, where the LwIP task will call back into the application's code when a state change occurs on the connection (data available to read, space available to write, etc.). Although the LwIP docs do not make this clear at all, the application code should not do any work in these callbacks. Rather, it should wake the application task, which would then take action to handle the state change.

So, if you happen to be using this callback feature (e.g. if you are calling netconn_new_with_callback()) and you are doing work inside the callback function, this could be a source of your problems.

Quote

as this was found to have much better granularity for task switching. With LWIP_TCPIP_CORE_LOCKING=1 you end up with a crude mutex across the entire API call, which is fine in a single RTOS task.

Frankly, in your case, I think something more subtle is going on. All LwIP API calls end up being serialized, regardless of whether core locking is enabled. Its just that, with core locking, the overhead of that serialization is much less (a single mutex acquisition/release vs two task switches minimum). So enabling core locking should never make things worse unless the system is misconfigured in some way.

peter-h · « **Reply #87 on:** February 12, 2023, 05:27:44 pm »

Quote

Do you mean literally nothing else of equal priority gets to run? This shouldn't be happening

Tasks of equal or higher priority do run through the TLS ops - unless they call LWIP. Then they get "hung up" for the ~3 seconds.

I've been trying to trace this by waggling a GPIO pin and watching it on a scope and it appears that the task itself does run as it should (e.g. its main loop still runs at say 100Hz - if there is a osDelay(10) at the bottom of the loop) but calls to the socket API appear to be ineffective.

I have not yet narrowed it down to whether lwip_read returns no data or the result of lwip_write disappears inside LWIP. But no data is actually lost.

Quote

Given the way the LwIP socket and netconn APIs work**, all TLS computation should be happening solely in the application thread

That is what I understand, but what happens to priority when TLS makes a lwip_write or lwip_read API call? Is that LWIP code running in the calling task's priority?

Quote

The only exception to this model is the optional netconn "callback" feature,

AFAIK this is not used.

Quote

All LwIP API calls end up being serialized, regardless of whether core locking is enabled. Its just that, with core locking, the overhead of that serialization is much less (a single mutex acquisition/release vs two task switches minimum). So enabling core locking should never make things worse unless the system is misconfigured in some way.

It may well be.

It could be that with core locking ON one needs more buffering around somewhere. I've spent many weeks on this and during that time have tried stuff like a huge increase in buffers (which the finished product could not run with) and it made no difference.

dare · « **Reply #88 on:** February 12, 2023, 06:51:45 pm »

Quote from: peter-h on February 12, 2023, 05:27:44 pm

That is what I understand, but what happens to priority when TLS makes a lwip_write or lwip_read API call? Is that LWIP code running in the calling task's priority?

lwip_write() ultimately calls netconn_write_partly() (LwIP sockets being a veneer on the netconn API). With core locking off, the netconn code creates an "API message", which captures the arguments to the write call. The API message includes a pointer to function that actually does the work of writing data to the connection, and a semaphore that is used to signal completion of the operation. The API message is then appended to the work queue of the tcpip task, after which the application task waits on the completion semaphore. The tcpip task dequeues the API message, calls the associated work function, and sets the semaphore, which in turn causes the application task to wake and ultimately return from the lwip_write() function.

The same process happens for any other task that is using either the sockets or netconn APIs. Thus all LwIP API calls are serialized by means of the tcpip task work queue. Also, because the synchronization object is a semaphore, at no point is there an inheritance of priority between application task(s) and the lwip tcpip task (and certainly no point at which the tcpip task's priority would be demoted to that of the app task).

Given this, there's no obvious way that an app task running TLS crypto code would block or otherwise hinder the tcpip task in servicing requests from other tasks.

But let me try another line of reasoning: how much outbound traffic is being generated by the TLS thread(s)? If I recall correctly, you are running with MEM_USE_POOLS = 0, MEM_LIBC_MALLOC = 0, MEMP_MEM_MALLOC = 1 and MEM_SIZE = 6KB. If a TLS thread is writing a lot of data, it could be dominating that 6K memory pool, which will cause other threads to have to wait when writing data to their connections. (Also remember that, if running as a TLS server, there is a point during the handshake where the server certificates are sent in a big chunk to the client. This can be a lot of data depending on how big your certificates are).

Quote

I've spent many weeks on this and during that time have tried stuff like a huge increase in buffers

Was this an increase in PBUF_POOL_SIZE or MEM_SIZE?

peter-h · « **Reply #89 on:** February 12, 2023, 09:36:08 pm »

In the absence of clear experimental data I have gone back to core locking 1 (and undefined these two
#define LWIP_ALLOW_MEM_FREE_FROM_OTHER_CONTEXT 1
#define SYS_LIGHTWEIGHT_PROT 1

I see a 2x speedup on file downloads in the http server, despite it running in idle priority (0). Previously it would need a priority of 1 (above various other tasks) to achieve this. So yes core locking ON does seem to improve performance.

Quote

lwip_write() ultimately calls netconn_write_partly() (LwIP sockets being a veneer on the netconn API). With core locking off, the netconn code creates an "API message", which captures the arguments to the write call. The API message includes a pointer to function that actually does the work of writing data to the connection, and a semaphore that is used to signal completion of the operation. The API message is then appended to the work queue of the tcpip task, after which the application task waits on the completion semaphore. The tcpip task dequeues the API message, calls the associated work function, and sets the semaphore, which in turn causes the application task to wake and ultimately return from the lwip_write() function.

Is that still the case with core locking ON?

Quote

Was this an increase in PBUF_POOL_SIZE or MEM_SIZE?

Hmm, it was the latter, not the former

With
#define PBUF_POOL_BUFSIZE 1500 + PBUF_LINK_ENCAPSULATION_HLEN + PBUF_LINK_HLEN + 2
increasing PBUF_POOL_SIZE from 4 to 5 of the above does amazingly remove the "TLS hanging" problem 99% of the time, so I went to 6.

At a cost of ~3k of RAM. This is OK.

I realise this is application(s) dependent but I am already running a pretty well fully loaded system.

Thank you for this insight!

peter-h · « **Reply #90 on:** February 13, 2023, 09:25:03 pm »

This is how I understand this works

The RTOS is pre-emptive, with a 1ms tick. Even if a task runs in a loop without yielding to RTOS, it will still get pre-empted by tasks whose priority is same or higher.
When an application makes an LWIP API call (Netconn or Sockets) the code initially runs at the priority of the calling task (whose priority can be anything from tskIDLE_PRIORITY (0) to LWIP's core priority which is osPriorityHigh (40)). LWIP then passes control to its core. The LWIP API is mutex protected so that only one calling task at a time can be active inside the API. This does not compromise multitasking performance because LWIP internally serialises the various requests anyway.
Normally this scheme works well because LWIP runs fast, and the ETH 10/100 interface runs fast, but if a large packet - bigger than about 4k - is being sent out, LWIP can run out of buffers while waiting for something, and this will hold up other tasks whose priority is less than LWIP. This can happen regardless of whether the other tasks use the LWIP API.
For tasks which use LWIP, and since these must not have a priority above LWIP, the only solution is to move smaller data blocks. However, if TLS is not used, this is rarely an issue.
For tasks which do not use LWIP, a priority same as or greater than LWIP will enable them to run without gaps.
The only time this has been observed as a problem is when TLS is in use. TLS performs crypto operations which for session setup can run for several seconds (RSA/EC crypto), and it transmits a lot of data. The result is that there is a block on the execution of tasks with a priority below LWIP's, for this duration.

It could well be that nearly all weird issues are due to not enough buffers in various places, but I sure as hell cannot work out why the netconn-based http server gets hung up by the ~3 sec TLS activity. I've tried loads more buffers and it makes no difference.

One thing I'd like to know is the relationship between MEM_SIZE and PBUF_POOL_SIZE and PBUF_POOL_BUFSIZE. This is a comment from my experiments from a year or so ago

Code: [Select]

// MEM_SIZE: the size of the heap memory. This is a statically allocated block. You can find it
// in the .map file as the symbol ram_heap and you can see how much of this RAM gets used.
// If MEMP_MEM_MALLOC=0, this holds just the PBUF_ stuff.
// Empirically this needs to be big enough for at least 4 x PBUF_POOL_BUFSIZE.
// 6k yields a good speed and going to 8k+ makes a minimal improvement. The main
// factor affecting speed in the KDE is the poll period in ethernetif_input().
// This value also limits the biggest block size sent out by netconn_write. With the
// MEM_SIZE of 6k, the biggest block netconn_write will send out is 4k.

I have 6k, 6 and MTU for the three, respectively, but examining the MEM_SIZE area (in the .map file as ram_heap) I see only 3k of it having been used, after some hours.

As usual a lot of stuff online, mostly unclear.

dare · « **Reply #91 on:** February 14, 2023, 05:00:27 am »

Quote from: peter-h on February 13, 2023, 09:25:03 pm

This is how I understand this works [...]

Close, but not quite.

First let me answer your earlier question about how LwIP works when core locking is enabled. Taking a similar scenario as before (an application task calling lwip_write()) the lwip code follows a nearly identical path as without core locking. Specifically, lwip_write() ultimately calls netconn_write_vectors_partly() which creates an API message structure that includes the data to be written, a work function pointer and a completion semaphore. Then the code calls tcpip_send_msg_wait_sem(), just as it did in the non-core locking case. However, at this point, instead of queuing the API message in the tcpip task's work queue (as it did in the non-core locking case), tcpip_send_msg_wait_sem() acquires the global "core lock" mutex and directly calls the work function itself (again, all on the app task). By acquiring the core lock, the code ensures that no other task, including the LwIP tcpip task itself, could be touching any of the shared data structures that hold the network state.

The work function (lwip_netconn_do_write()) behaves pretty much identically in both the core locking and non-core locking modes. Its job is to take the application's data, append it onto the TCP send queue (possibly copying it in the process) and, if the time is right, initiate the sending of a TCP packet containing some or all of the data. When a packet is to be sent, the application's data is prepended with the appropriate TCP, IP and Ethernet headers, and the final Ethernet packet passed to the netif output function. Again, when core locking is enabled, all this happens on the application's task. The job of the netif output function is to queue the Ethernet packet for transmission by the MAC. For proper behavior, its design should be such that it never blocks. Thus, in the "sunny day" scenario (i.e. where there are plenty of buffers available) the only time the application thread will block is when it is waiting for the core lock. And since the LwIP code is designed to never block when the core lock is held, the amount of time an application task will spend waiting for the core lock should always be short.

Okay, so then what happens in the non-sunny day case when you run out of buffers? In this situation, the work function discovers that there aren't enough buffers to all the application's data. If the socket/netconn happens to be in non-blocking mode, then function just stops where it is and returns an error back to the application telling it to try again. But if the netconn is in blocking mode, then by definition the API function cannot return until all the data has been consumed (or a fatal connection error has occurred). In this case, the work function (still running on the app task) places the netconn in a mode that says there is more application data waiting to be added to the TCP send queue. It then releases the core lock and waits on the completion semaphore. With the core lock released, the LwIP tcpip task is now free do its thing, one of which is to wake up periodically and check if there more buffers are available. When there are buffers available it proceeds to fill them with the application's data. Once all the data has been copied, the tcpip task sets the completion semaphore, which wakes up the application task, and causes it to return from the API write function.

OK, with all that said, let me attempt to rewrite your summary of the way things work:

FreeRTOS is pre-emptive, with a 1ms tick. Even if a task runs in a loop without yielding, it will still get pre-empted by other tasks whose priority is the same or higher.

When an application makes an LwIP API call (Netconn or Sockets) the code initially runs on the calling application's task, at its set priority. [For core locking mode only] At some point during the API call the code will attempt to acquire the LwIP core lock mutex, and if necessary, will block waiting for another task to release it. However, since the LwIP code is designed to never block while holding the core lock, the wait time will be small, and in many cases 0. Once the core lock is acquired, the LwIP code running on the app task completes the requested API operation, releases the core lock and returns to the application code. In this way, the LwIP core lock serves to serialize all activities in the LwIP stack, so as to protect shared data structures.

Normally this works well. However, if the system is busy, LwIP can run out of buffers. When this happens, the application task will block in the API call waiting for more buffers to be available. The LwIP tcpip task will then periodically check the state of the buffer pool, and once buffers are available, it will complete the outstanding API work and wake the application task.

When buffers are low, other tasks calling the LwIP API concurrently will not block unless they too need buffers to complete their work. Compute-bound tasks which are not calling LwIP (e.g. those performing heavy crypto) will not affect activity in LwIP regardless of buffer availability unless the priority of the compute-bound tasks is higher than that of the LwIP app tasks, or the LwIP tcpip task itself. For best communications performance, the LwIP tcpip task should run at a high priority, the LwIP application tasks should run at a medium priority and compute-bound tasks should run at a lower priority.

Unfortunately, because most LwIP applications need buffers to do their work, running out of buffers can bring everything to a near stand-still. Generally, the only solution is either to increase the size of the buffer pool or throttle the overall communication rate (send, receive or both) of the LwIP application tasks.

Quote

One thing I'd like to know is the relationship between MEM_SIZE and PBUF_POOL_SIZE and PBUF_POOL_BUFSIZE.

Given your configuration, I think the simple answer is that MEM_SIZE governs the number of buffers available for outgoing data (i.e. data your app send over the network) and PBUF_POOL_SIZE governs the buffers available for incoming data (i.e. packets you receive from the network). Setting MEM_SIZE too small will induce the kind of buffer starvation/write blocking behavior described above. Setting PBUF_POOL_SIZE too small will result in inbound packets being dropped in the low-level input function.

dare · « **Reply #92 on:** February 14, 2023, 05:24:43 am »

Regarding the particular behavior you're seeing, I see that you have set TCP_SND_BUF to 4 * TCP_MSS. That works out to 4 * (1500 - 40) = 5840. TCP_SND_BUF represent the maximum amount of application data that can be written to a particular the TCP send queue. That number is awfully close to your MEM_SIZE value. What this means is that a single TCP connection can dominate the buffers available in your memory heap if it happens to be writing faster than other tasks.

Perhaps this is what is happening with your TLS task. TLS sends a huge chunk of data (the server certificates) to the peer during the handshake. Because your window size is 2*TCP_MSS, getting all that data to the peer likely requires multiple round trips. Furthermore, LwIP + FreeRTOS isn't necessarily fair about dolling out buffers when they become available. So it may be that your TLS task is getting more than its fair share.

Try setting TCP_SND_BUF to 2*TCP_MSS and see if that helps. I would also make sure that all app tasks that use LwIP are running at the same priority. I might even go so far as to set your TLS task to lower priority than your other tasks.

peter-h · « **Reply #93 on:** February 15, 2023, 07:46:08 am »

I cannot thank you too much, dare, for these amazing and illuminating posts.

BTW I did more tests, with a) LWIP_TCPIP_CORE_LOCKING ON and b) LWIP_TCPIP_CORE_LOCKING OFF (and with LWIP_ALLOW_MEM_FREE_FROM_OTHER_CONTEXT and SYS_LIGHTWEIGHT_PROT ON as apparently required if calling the API from multiple tasks) and the only difference I can reproduce is that the former delivers better file transfer performance for a given fairly small number of PBUFs. But this is transfer from LWIP so presumably this increase is related to being able to buffer more ACKs. So I have gone back to LWIP_TCPIP_CORE_LOCKING ON. As you said it should not make much difference because LWIP internally serialises the processing of the messages anyway (it runs only one RTOS task).

Quote

Given your configuration, I think the simple answer is that MEM_SIZE governs the number of buffers available for outgoing data (i.e. data your app send over the network) and PBUF_POOL_SIZE governs the buffers available for incoming data (i.e. packets you receive from the network). Setting MEM_SIZE too small will induce the kind of buffer starvation/write blocking behavior described above. Setting PBUF_POOL_SIZE too small will result in inbound packets being dropped in the low-level input function.

This is so poorly documented in the various online sources. I did find, empirically, that MEM_SIZE directly limits the biggest packet which can be sent in one go. What happens if you try to send a bigger one I didn't investigate because I've been using

Code: [Select]

netconn_write(conn, PAGE_HEADER_ALL, strlen((char*)PAGE_HEADER_ALL), NETCONN_COPY);

and not checking the return code. In the context there was little one could do with it. A MEM_SIZE=6k allowed about 4k to be sent out. So MEM_SIZE is really about transmit.

Various sources suggest the PBUFs are used for rx only, while others suggest they are used for both rx and tx. I could not see the point of using this "list of buffers" system for tx because tx can be done by taking the caller's packet and repackaging it (presumably using some buffer temporarily allocated in MEM_SIZE) into one or more packets which are sent down to the PHY layer. In transmit mode, LWIP has total control. OTOH TCP packets are ACKed so presumably the PBUFs are used for these too and there would lie another bottleneck; that may be a reason for having a larger number of smaller PBUFs (the original ST port of LWIP used 512 byte PBUFs).

Quote

I see that you have set TCP_SND_BUF to 4 * TCP_MSS

It was already set to 2xTCP_MSS following your earlier suggestion.

Quote

Perhaps this is what is happening with your TLS task. TLS sends a huge chunk of data (the server certificates) to the peer during the handshake

I asked the guy who did the TLS integration on this product. It was empirically determined that TLS sends out about 1k during the setup. For the size of subsequent data packets, one can negotiate the size, and we have set this to 4k. So yeah one comes back to your 4k. This relates to TLS needing two 16k buffers by definition (16k rx and 16k tx) but if you control one or both ends then these can be reduced. With an HTTPS client the rx buffer must still be 16k but the tx buffer can be smaller because you control what is sent out, and thus save 12k of RAM.

For background: Currently TLS is working with a statically allocated block of 48k in which it runs its private heap; this seems sufficient for everything that's been tested, but unfortunately this is not (and cannot be) deterministic. It depends on what crypto suites get negotiated. AES was using quite a lot of RAM (the TLS implementation was actually pretty fast, at 800kbytes/sec) and moving it to the 32F417's hardware AES saved 12k. But other savings are much smaller and the 417 doesn't do RSA/EC hence all this hassle with the 3 sec hangs. We have not tested DES/3DES which could be moved to 417 hardware, but I've not been able to find out if anybody out there might still be using it, and there is a proposal for new TLS to drop DES. I think this is dangerous because e.g. it was found that some root certificates (some of these are old e.g. 2005) are signed with a "supposedly deprecated" hash and if you tried to be clever and removed support for obsolete hashes because you read on the internet that they should not be used, you would have a problem! So the guy ported our TLS code to a win32 executable which can be run on a connection which isn't working and will tell you directly why not. The problem with doing crypto in hardware is that it prevents one using one of the chinese copies of the 417 which are just a 407 (no crypto).

I've tried increasing MEM_SIZE to silly values and it doesn't fix the hanging issue.

I've tried the priority adjustment and it makes no difference. Currently all "internet" tasks run at idle (0).

The main task which was getting hung up by TLS is now running OK due to pool size = 6 (was 4)

Code: [Select]

/* ---------- Pbuf dynamically allocated buffer blocks  ----------
*
* These settings make little difference to performance, which is dominated by
* the low_level_input poll period. These PBUFs relate directly to the netconn API netbufs.
*
* PBUF_POOL_SIZE: the number of buffers in the pbuf pool.
* Statically allocated, so each increment frees up 1.5k of RAM. Any value under 4 slows
* down the data rate a lot.
* 12/2/23 PH 4 -> 6 fixes TLS blocking of various RTOS tasks (5 almost does).
* The most this can go to with TLS still having enough free RAM for its 48k block is ~9.
*/

#define PBUF_POOL_SIZE           6

/* PBUF_POOL_BUFSIZE: the size of each pbuf in the pbuf pool.
* The +2 is to get a multiple of 4 bytes so the PBUFs are 4-aligned (for memcpy_fast())
* Probably the +2 is not needed because the PBUFs are 4-aligned anyway in the LWIP code.
*/

#define PBUF_POOL_BUFSIZE  1500 + PBUF_LINK_ENCAPSULATION_HLEN + PBUF_LINK_HLEN + 2

The remaining task that still is getting hung up by TLS is an http server which uses the netconn API and nothing I do seems to fix that, not even a priority of 40 (same as LWIP) or a huge increase in the buffers, but it doesn't matter because it is intended for local config only. I would never let a user run a server on an open port anyway

I am very sure that its RTOS task isn't actually hanging; it is hanging on the LWIP API calls. That in turn suggests insufficient buffers somewhere.

FWIW my lwipopts.h file is below, in case anybody is digging around here in years to come

Code: [Select]

/**
  ******************************************************************************
  * @file    LwIP/LwIP_HTTP_Server_Netconn_RTOS/Inc/lwipopts.h
  * @author  MCD Application Team
  * @brief   lwIP Options Configuration.
  ******************************************************************************
*
* 	This sort of explains the memory usage
*	[url]https://lwip-users.nongnu.narkive.com/dkzkPa8l/lwip-memory-settings[/url]
*	[url]https://www.cnblogs.com/shangdawei/p/3494148.html[/url]
*	[url]https://lwip.fandom.com/wiki/Tuning_TCP[/url]
*	[url]https://groups.google.com/g/osdeve_mirror_tcpip_lwip/c/lFYJ7Fw0Cxg[/url]
*	ST UM1713 document gives an overview of integrating all this.
*
*
*
*
*	7/7/22	PH	MEM_SIZE set to 5k (was 10k). Only ~1.5k is used.
*	13/7/22	PH	MEMP_MEM_MALLOC=1, MEM_SIZE=16k.
*	14/7/22	PH	MEMP_MEM_MALLOC=0; 1 was unreliable. 8 x 512 byte buffers now.
*	6/8/22	PH	4 x MTU buffers for RX. Done partly for EditGetData().
*				MEM_SIZE=6k. 5k is used - see .map file for ram_heap address.
*				Sizes of various static RAM structures determined experimentally.
*  17/12/22	PH	LWIP_TCP_KEEPALIVE=1 (for ethser)
*  27/1/23	PH	IP_SOF_BROADCAST etc added, later commented-out.
*  10/2/23	PH	PBUF_POOL_BUFSIZE 4-aligned.
*  12/2/23	PH	PBUF_POOL_SIZE = 6.
*
*
*
*
*
*
*
*
  */
#ifndef __LWIPOPTS_H__
#define __LWIPOPTS_H__

/**
 * NO_SYS==1: Provides VERY minimal functionality. Otherwise,
 * use lwIP facilities.
 */
#define NO_SYS                  	0

// Flag to make LWIP API thread-safe. The netconn and socket APIs are claimed
// to be thread-safe anyway. The raw API is never thread-safe.
// A huge amount of online discussion on this topic; most of it unclear, but
// ON (1) seems to be recommended, as being more efficient.

#define LWIP_TCPIP_CORE_LOCKING   	1

// If LWIP_TCPIP_CORE_LOCKING=0 then these two need to be 1
// See [url]https://www.nongnu.org/lwip/2_1_x/multithreading.html[/url]
//#define LWIP_ALLOW_MEM_FREE_FROM_OTHER_CONTEXT 1
//#define SYS_LIGHTWEIGHT_PROT            1

// This places more objects into the static block defined by MEM_SIZE.
// Uses mem_malloc/mem_free instead of the lwip pool allocator.
// MEM_SIZE now needs to be increased by about 10k.
// It doesn't magically produce extra memory, and causes crashes.
// There is also a performance loss, apparently. AVOID.
#define MEMP_MEM_MALLOC				0


//NC: Need for sending PING messages by keepalive
#define LWIP_RAW 					1
#define DEFAULT_RAW_RECVMBOX_SIZE 	4

// For ETHSER
#define LWIP_TCP_KEEPALIVE			1

/*-----------------------------------------------------------------------------*/

/* LwIP Stack Parameters (modified compared to initialization value in opt.h) -*/
/* Parameters set in STM32CubeMX LwIP Configuration GUI -*/

/*----- Value in opt.h for LWIP_DNS: 0 -----*/
#define LWIP_DNS 1

/* ---------- Memory options ---------- */
/* MEM_ALIGNMENT: should be set to the alignment of the CPU for which
   lwIP is compiled. 4 byte alignment -> define MEM_ALIGNMENT to 4, 2
   byte alignment -> define MEM_ALIGNMENT to 2. */
#define MEM_ALIGNMENT           4

/*
* MEM_SIZE: the size of the heap memory. This is a statically allocated block. You can find it
* in the .map file as the symbol ram_heap and you can see how much of this RAM gets used.
* If MEMP_MEM_MALLOC=0, this holds just the PBUF_ stuff.
* If MEMP_MEM_MALLOC=1 (which is not reliable) this greatly expands and needs 16k+.
* Empirically this needs to be big enough for at least 4 x PBUF_POOL_BUFSIZE.
* This value also limits the biggest block size sent out by netconn_write. With a MEM_SIZE
* of 6k, the biggest block netconn_write (and probably socket write) will send out is 4k.
* This setting is mostly related to outgoing data.
*/

#define MEM_SIZE                (6*1024)

// MEMP_ structures. Their sizes have been determined experimentally, by
// increasing them and seeing free RAM changing.

/* MEMP_NUM_PBUF: the number of memp struct pbufs. If the application
   sends a lot of data out of ROM (or other static memory), this
   should be set high. */
//NC: Increased to 20 for ethser
#define MEMP_NUM_PBUF           20	// each 1 is 20 bytes of static RAM

/* MEMP_NUM_UDP_PCB: the number of UDP protocol control blocks. One
   per active UDP "connection". */
#define MEMP_NUM_UDP_PCB        6	// each 1 is 32 bytes of static RAM

/* MEMP_NUM_TCP_PCB: the number of simultaneously active TCP
   connections. */
//NC: Increased to 20 for ethser
#define MEMP_NUM_TCP_PCB        20	// each 1 is 145 bytes of static RAM

//NC: Have more sockets available. Is set to 4 in opt.h
#define MEMP_NUM_NETCONN   		10

/* MEMP_NUM_TCP_PCB_LISTEN: the number of listening TCP
   connections. */
//NC: Increased to 20 for ethser
#define MEMP_NUM_TCP_PCB_LISTEN 20	// each 1 is 28 bytes of static RAM

/* MEMP_NUM_TCP_SEG: the number of simultaneously queued TCP
   segments. */
// Was 8; increased to 16 as it improves ETHSER reliability when running
// HTTP server
#define MEMP_NUM_TCP_SEG        16 // each 1 is 20 bytes of static RAM

/* MEMP_NUM_SYS_TIMEOUT: the number of simulateously active
   timeouts. */
#define MEMP_NUM_SYS_TIMEOUT    10	// each 1 is 16 bytes of static RAM


/* ---------- Pbuf dynamically allocated buffer blocks  ----------
*
* These settings make little difference to performance, which is dominated by
* the low_level_input poll period. These PBUFs relate directly to the netconn API netbufs.
*
* PBUF_POOL_SIZE: the number of buffers in the pbuf pool.
* Statically allocated, so each increment frees up 1.5k of RAM. Any value under 4 slows
* down the data rate a lot.
* 12/2/23 PH 4 -> 6 fixes TLS blocking of various RTOS tasks (5 almost does).
* The most this can go to with TLS still having enough free RAM for its 48k block is ~9.
*/

#define PBUF_POOL_SIZE           6

/* PBUF_POOL_BUFSIZE: the size of each pbuf in the pbuf pool.
* The +2 is to get a multiple of 4 bytes so the PBUFs are 4-aligned (for memcpy_fast())
* Probably the +2 is not needed because the PBUFs are 4-aligned anyway in the LWIP code.
*/

#define PBUF_POOL_BUFSIZE  1500 + PBUF_LINK_ENCAPSULATION_HLEN + PBUF_LINK_HLEN + 2


/* ---------- TCP options ---------- */
#define LWIP_TCP                1
#define TCP_TTL                 255

/* Controls if TCP should queue segments that arrive out of
   order. Define to 0 if your device is low on memory. */
#define TCP_QUEUE_OOSEQ         0

/* TCP Maximum segment size. */
#define TCP_MSS                 (1500 - 40)	  /* TCP_MSS = (Ethernet MTU - IP header size - TCP header size) */

/* TCP sender buffer space (bytes). */
// Reduced from 4*MSS to leave more room for TX packets in the LWIP heap (MEM_SIZE).
#define TCP_SND_BUF             (2*TCP_MSS)		// no effect on static RAM

/*  TCP_SND_QUEUELEN: TCP sender buffer space (pbufs). This must be at least
  as much as (2 * TCP_SND_BUF/TCP_MSS) for things to work. */
// Was 2*; increased to 4* as it improves ETHSER reliability when running
// HTTP server
#define TCP_SND_QUEUELEN        (4* TCP_SND_BUF/TCP_MSS) // (2* TCP_SND_BUF/TCP_MSS)

/* TCP advertised receive window. */
// Should be less than PBUF_POOL_SIZE * (PBUF_POOL_BUFSIZE - protocol headers)
#define TCP_WND                 (2*TCP_MSS)		// no effect on static RAM


/* ---------- ICMP options ---------- */
#define LWIP_ICMP            	1


/* ---------- DHCP options ---------- */
#define LWIP_DHCP               1


/* ---------- UDP options ---------- */
#define LWIP_UDP                1
#define UDP_TTL                 255

// These are build flags which disable the support for the SOF_BROADCAST option on raw and UDP PCBs
// Commented-out because changing these requires a recompilation, and an application which receives
// broadcast packets may one day be necessary (set g_eth_multi=true to disable the packet filter in
// ethernetif.c)
//#define IP_SOF_BROADCAST                1
//#define IP_SOF_BROADCAST_RECV           1


/* ---------- Statistics options ---------- */
#define LWIP_STATS 0

/* ---------- link callback options ---------- */
/* LWIP_NETIF_LINK_CALLBACK==1: Support a callback function from an interface
 * whenever the link changes (i.e., link down)
 * 8/2022 this is done from the low_level_input RTOS task.
 */
#define LWIP_NETIF_LINK_CALLBACK        0




/*
   --------------------------------------
   ---------- Checksum options ----------
   --------------------------------------
*/

/* 
The STM32F4xx allows computing and verifying the IP, UDP, TCP and ICMP checksums by hardware:
 - To use this feature let the following define uncommented.
 - To disable it and process by CPU comment the  the checksum.
*/
#define CHECKSUM_BY_HARDWARE 


#ifdef CHECKSUM_BY_HARDWARE
  /* CHECKSUM_GEN_IP==0: Generate checksums by hardware for outgoing IP packets.*/
  #define CHECKSUM_GEN_IP                 0
  /* CHECKSUM_GEN_UDP==0: Generate checksums by hardware for outgoing UDP packets.*/
  #define CHECKSUM_GEN_UDP                0
  /* CHECKSUM_GEN_TCP==0: Generate checksums by hardware for outgoing TCP packets.*/
  #define CHECKSUM_GEN_TCP                0 
  /* CHECKSUM_CHECK_IP==0: Check checksums by hardware for incoming IP packets.*/
  #define CHECKSUM_CHECK_IP               0
  /* CHECKSUM_CHECK_UDP==0: Check checksums by hardware for incoming UDP packets.*/
  #define CHECKSUM_CHECK_UDP              0
  /* CHECKSUM_CHECK_TCP==0: Check checksums by hardware for incoming TCP packets.*/
  #define CHECKSUM_CHECK_TCP              0
  /* CHECKSUM_CHECK_ICMP==0: Check checksums by hardware for incoming ICMP packets.*/  
  #define CHECKSUM_GEN_ICMP               0
#else
  /* CHECKSUM_GEN_IP==1: Generate checksums in software for outgoing IP packets.*/
  #define CHECKSUM_GEN_IP                 1
  /* CHECKSUM_GEN_UDP==1: Generate checksums in software for outgoing UDP packets.*/
  #define CHECKSUM_GEN_UDP                1
  /* CHECKSUM_GEN_TCP==1: Generate checksums in software for outgoing TCP packets.*/
  #define CHECKSUM_GEN_TCP                1
  /* CHECKSUM_CHECK_IP==1: Check checksums in software for incoming IP packets.*/
  #define CHECKSUM_CHECK_IP               1
  /* CHECKSUM_CHECK_UDP==1: Check checksums in software for incoming UDP packets.*/
  #define CHECKSUM_CHECK_UDP              1
  /* CHECKSUM_CHECK_TCP==1: Check checksums in software for incoming TCP packets.*/
  #define CHECKSUM_CHECK_TCP              1
  /* CHECKSUM_CHECK_ICMP==1: Check checksums by hardware for incoming ICMP packets.*/  
  #define CHECKSUM_GEN_ICMP               1
#endif


/*
   ----------------------------------------------
   ---------- Sequential layer options ----------
   ----------------------------------------------
*/
/**
 * LWIP_NETCONN==1: Enable Netconn API (require to use api_lib.c)
 */
#define LWIP_NETCONN                    1

/*
   ------------------------------------
   ---------- Socket options ----------
   ------------------------------------
*/
/**
 * LWIP_SOCKET==1: Enable Socket API (require to use sockets.c)
 */
#define LWIP_SOCKET                     1

/*
   ------------------------------------
   ---------- httpd options ----------
   ------------------------------------
*/
/** Set this to 1 to include "fsdata_custom.c" instead of "fsdata.c" for the
 * file system (to prevent changing the file included in CVS) */
#define HTTPD_USE_CUSTOM_FSDATA   0

/*
   ---------------------------------
   ---------- OS options ----------
   ---------------------------------
*/

#define TCPIP_THREAD_NAME              "TCP/IP"
#define TCPIP_THREAD_STACKSIZE          4096
#define TCPIP_MBOX_SIZE                 6
#define DEFAULT_UDP_RECVMBOX_SIZE       6
#define DEFAULT_TCP_RECVMBOX_SIZE       6
#define DEFAULT_ACCEPTMBOX_SIZE         6
#define DEFAULT_THREAD_STACKSIZE        512
#define TCPIP_THREAD_PRIO               osPriorityHigh	// should be >= that of any TCP/IP apps

#define LWIP_DEBUG 1

/*
#define IP_DEBUG LWIP_DBG_ON
#define DHCP_DEBUG LWIP_DBG_OFF
#define UDP_DEBUG LWIP_DBG_ON
#define SOCKET_DEBUG_LWIP_DBG_ON
//#define ICMP_DEBUG LWIP_DBG_ON|LWIP_DBG_TRACE
//#define NETIF_DEBUG LWIP_DBG_OFF
#define LWIP_DBG_TYPES_ON  (LWIP_DBG_TRACE|LWIP_DBG_STATE)
*/

#define LWIP_SO_RCVTIMEO	1
#define LWIP_NETIF_HOSTNAME	1
#define SO_REUSE 1

// Defining these produces various errors
//#define LWIP_IPV6 						1
//#define LWIP_IPV6_DHCP6 				1

/*
// TODO
#ifdef LWIP_DEBUG

#define MEMP_OVERFLOW_CHECK            ( 1 )
#define MEMP_SANITY_CHECK              ( 1 )

#define MEM_DEBUG        LWIP_DBG_OFF
#define MEMP_DEBUG       LWIP_DBG_OFF
#define PBUF_DEBUG       LWIP_DBG_ON
#define API_LIB_DEBUG    LWIP_DBG_ON
#define API_MSG_DEBUG    LWIP_DBG_ON
#define TCPIP_DEBUG      LWIP_DBG_ON
#define NETIF_DEBUG      LWIP_DBG_ON
#define SOCKETS_DEBUG    LWIP_DBG_ON
#define DEMO_DEBUG       LWIP_DBG_ON
#define IP_DEBUG         LWIP_DBG_ON
#define IP6_DEBUG        LWIP_DBG_ON
#define IP_REASS_DEBUG   LWIP_DBG_ON
#define RAW_DEBUG        LWIP_DBG_ON
#define ICMP_DEBUG       LWIP_DBG_ON
#define UDP_DEBUG        LWIP_DBG_ON
#define TCP_DEBUG        LWIP_DBG_ON
#define TCP_INPUT_DEBUG  LWIP_DBG_ON
#define TCP_OUTPUT_DEBUG LWIP_DBG_ON
#define TCP_RTO_DEBUG    LWIP_DBG_ON
#define TCP_CWND_DEBUG   LWIP_DBG_ON
#define TCP_WND_DEBUG    LWIP_DBG_ON
#define TCP_FR_DEBUG     LWIP_DBG_ON
#define TCP_QLEN_DEBUG   LWIP_DBG_ON
#define TCP_RST_DEBUG    LWIP_DBG_ON
#define PPP_DEBUG        LWIP_DBG_OFF

#define LWIP_DBG_TYPES_ON         (LWIP_DBG_ON|LWIP_DBG_TRACE|LWIP_DBG_STATE|LWIP_DBG_FRESH|LWIP_DBG_HALT)

#endif
*/

#endif /* __LWIPOPTS_H__ */

dare · « **Reply #94 on:** February 15, 2023, 06:35:48 pm »

Quote from: peter-h on February 15, 2023, 07:46:08 am

What happens if you try to send a bigger one I didn't investigate because I've been using

Well, at the level of the LwIP API you're not sending a big "packet" per se, but rather just a big chuck of data. Regardless of how big a chuck you write, the actual size of the outgoing packets will be limited by the TCP_MSS value. And as I described earlier, if you exhaust the MEM_SIZE heap while writing a big chunk your task will block in the write call.

Quote

It was already set to 2xTCP_MSS following your earlier suggestion.

Ah, indeed. Sorry, I lost track of our earlier conversation.

Quote

We have not tested DES/3DES which could be moved to 417 hardware, but I've not been able to find out if anybody out there might still be using it, and there is a proposal for new TLS to drop DES.

DES/3DES is dead, dead, dead. No need to even think about it.

I'm assuming the other end of the TLS connection is going to be a relatively standard server/desktop class TLS implementation. In that case, I'd build the TLS code such that it only supports TLS 1.3, which only allows AES and ChaCha20-Poly1305 anyway. Indeed, I'd go so far as to limit the cipher suites to just TLS_AES_128_GCM_SHA256, or TLS_AES_128_CCM_SHA256 if you can get away with it. No need for anything else.

Quote

The remaining task that still is getting hung up by TLS is an http server which uses the netconn API and nothing I do seems to fix that, not even a priority of 40 (same as LWIP) or a huge increase in the buffers, but it doesn't matter because it is intended for local config only. I would never let a user run a server on an open port anyway I am very sure that its RTOS task isn't actually hanging; it is hanging on the LWIP API calls. That in turn suggests insufficient buffers somewhere.

Maybe this is easier said than done, but it seems like its time to get a debugger on it and see what the threads are doing when its hung.

peter-h · « **Reply #95 on:** February 16, 2023, 04:23:08 am »

Quote

DES/3DES is dead, dead, dead. No need to even think about it.

Can one be 100% sure? There are loads of servers out there, hosting highly public stuff like PHP-BB communities, which have not been patched for 20 years. One I know got hit recently

The box I am doing is for industrial applications and that area is notorious for "it if works, leave it". Also after a few years the person who set it up has usually left and most coders (especially server-side - something I am quite involved with spec-wise and management-wise, although I don't do the actual config and coding) never document anything. It is only when something breaks because some certificate has expired that... there is panic. And money...

Quote

Maybe this is easier said than done, but it seems like its time to get a debugger on it and see what the threads are doing when its hung.

You will have the last laugh. I did some GPIO waggle debug and found it was getting hung by TLS's filesystem reading of the certificates file. It was getting hung because the HTTP server also displays a file directory

FatFS obviously needs mutex protection for writing (FAT12 in this case) although it isn't obvious that a mutex is needed for reading (I have one though). Re-entrancy is an option on FatFS but if you enable it, you cannot run FatFS before the RTOS starts so e.g. you cannot read a config file which selects which RTOS tasks to run.

Another interesting point, for which I can't find a clear explanation, is related to FreeRTOS

#define configUSE_PREEMPTION 1
#define configUSE_TIME_SLICING 1

and whether the following is correct:

The RTOS is not pre-emptive on a time slice basis at the same priority, so a process which is just a loop will get nearly all CPU time unless there are higher priority tasks running.

The question is whether "same" is correct. I think if you have a load of tasks at priority 10, and all with a
while (true) {}
loop, the RTOS will switch them at 1kHz and each one will get 1ms. This behaviour was changed in FreeRTOS some years ago, from what I can find.

Obviously hanging like that is dumb and if you have nothing to do then
while (true)
{
osDelay(1);
}
is much better because the RTOS can move on right away. And it can run lower priority tasks too, whereas
while (true)
{
taskYield();
}
enables it to run same or higher priority tasks only.

peter-h · « **Reply #96 on:** March 07, 2023, 04:38:28 pm »

Not quite LWIP but MbedTLS runs on top of LWIP, usually.

This turned up on the MbedTLS mailing list, and I am posting it here so somebody can have a laugh at the irony of "security" in the embedded sphere

To: mbed-tls@lists.trustedfirmware.org
Subject: [mbed-tls] security issue in mbedtls 3.30
From: Avi Epstein via mbed-tls <mbed-tls@lists.trustedfirmware.org>
Date: Tue, 07 Mar 2023 16:02:22 -0000

security issue in mbedtls 3.30 in the release notes:

"An adversary with access to precise enough information about memory
accesses (typically, an untrusted operating system attacking a secure
enclave) could recover an RSA private key after observing the victim
performing a single private-key operation if the window size used for the
exponentiation was 3 or smaller. Found and reported by Zili KOU,
Wenjian HE, Sharad Sinha, and Wei ZHANG. See "Cache Side-channel Attacks
and Defenses of the Sliding Window Algorithm in TEEs" - Design, Automation
and Test in Europe 2023."

was this issue solved in this version?
--
mbed-tls mailing list -- mbed-tls@lists.trustedfirmware.org
To unsubscribe send an email to mbed-tls-leave@lists.trustedfirmware.org

dare · « **Reply #97 on:** March 07, 2023, 06:59:17 pm »

Quote from: peter-h on March 07, 2023, 04:38:28 pm

See "Cache Side-channel Attacks and Defenses of the Sliding Window Algorithm in TEEs"

I appears that this paper has yet to be published. However, based on the abstract, the described attack mode has nothing to do with the class of processors being discussed here. In particular, this is an attack against systems with Trusted Execution Environments (TEEs), e.g. Intel SGX or ARM TrustZone. And as with most such attacks, this one appears to be limited to situations where adversarial code is running on the target system itself, which is typically only seen in environments such as mobile phones and virtual machine services. Monolithic embedded applications such as those that run on small STM32 parts are not subject to this mode of attack.

That said, side channel attacks (where secret data are leaked via seemly innocuousness means, such as EMF, power use fluctuations, or even sound) are very real and devilishly hard to avoid. Indeed, the laptop you're reading this on is probably leaking all sorts of things right now: https://www.cse.wustl.edu/~roger/566S.s21/09065580.pdf. Because of this, I believe that evidence of a side channel attack involving Mbed TLS is not really an indictment of IoT security as a whole as much as it is an indication of how hard security is in general.

You have to keep this in perspective, though. It takes a lot of work to mount a practical side-channel attack against a monolithic embedded application. And in most cases this requires close proximity to the victim device. So side channel attacks generally are less urgent to defend against than, say, remote code execution attacks resulting from simple coding bugs.

peter-h · « **Reply #98 on:** March 07, 2023, 08:52:18 pm »

Thast's an interesting paper (on the laptop stuff) but what happened to Van Eck? The old CRTs were great for that because the scan coils and grid drives generated tons of RF (I used to design CRT drive boards in the 1980s) and LCDs obviously work differently, but I am 100% sure you can recover a laptop screen image from a great distance.

I think most side channel attacks rely on knowing the structure of the software running on the machine. If open source, this is trivial, so if you have say Apache running on one VM on some server, you know where to look.

With embedded boxes, these are rarely open source so you have an uphill job right at the start. Key leakage is still possible (as you note, by supply current measurement, Vcc manipulation, etc, and smartcard chips are designed to resist that stuff, and you are free to put one of these in your IOT box if needed) but with no knowledge of the software, where do you start? Breaking the security fuse protection to start with, then disassembly...

But what makes me laugh is that there is no security whatsoever without physical security, and 99.9% of the time in the IOT sphere you have no physical security. Mostly, you can access the actual box, get into some boot loader or whatever, replace the TLS certificate store, and if you install a fake DNS server on the LAN then you are in.

There is stuff like secure Modbus, which is completely pointless since all interconnected boxes are usually on one LAN, with no physical security...

I would bet that for remote code execution to be viable it would need to be applied against known open source software which you know is used inside the box - such as LWIP or MbedTLS. But it would still be hard work, probably needing to be preceeded by breaking the CPU security and disassembling a load of code.

What does one do in applications where you do want secure comms for very good known reasons but you don't want a whole PC-like server in there? For example an electricity substation which you know is going to get remotely attacked by Russia? This is a decades-old problem which pre-dates modern over-internet hacking (look up the GI74 protocol). "IOT" is no good for that because the boxes are rarely if ever patched for security reasons. It is a good MbedTLS application - in theory but absolutely not in practice.

peter-h · « **Reply #99 on:** May 02, 2023, 02:38:30 pm »

Would anyone know if LWIP or TLS heap manager has this bug
https://www.eevblog.com/forum/programming/help-needed-with-some-heap-test-code/msg4844144/#msg4844144


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

Author Topic: Anyone here familiar with LWIP? (Read 17070 times)

Share me