Author Topic: Anyone here familiar with LWIP? (Read 17140 times)

dare · « **Reply #50 on:** January 28, 2023, 12:49:08 pm »

My apologies. I forgot you are initiating connections from your system. Try the following version of the function. It also includes logging, so you can see the decision being made along with the input data.

Code: [Select]

static bool should_accept_ethernet_packet(const uint8_t * pktData, size_t pktLen)
{
  const struct eth_hdr * pktHdr = (const struct eth_hdr *)pktData;
  bool accept;

  /* If the packet is too short, reject it immediately, without trying to log it. */
  if (pktLen < sizeof(struct eth_hdr)) {
    return false;
  }

  /* if the packet is to an IPv4 multicast destination, reject it. */
  else if (pktHdr->dest.addr[0] == LL_IP4_MULTICAST_ADDR_0 &&
           pktHdr->dest.addr[1] == LL_IP4_MULTICAST_ADDR_1 &&
           pktHdr->dest.addr[2] == LL_IP4_MULTICAST_ADDR_2) {
    accept = false;
  }

  /* If the packet is a broadcast... */
  else if (eth_addr_cmp(&pktHdr->dest, &ethbroadcast)) {

    /* If the packet is an ARP packet, accept it. */
    if (ntohs(pktHdr->type) == ETHTYPE_ARP) {
      accept = true;
    }

    /* Reject all other broadcast packets. */
    else {
      accept = false;
    }
  }

  /* if the packet is a unicast ARP or IPv4 packet, accept it. */
  else if (ntohs(pktHdr->type) == ETHTYPE_ARP ||
           ntohs(pktHdr->type) == ETHTYPE_IP) {
    accept = true;
  }

  /* Reject all other unicast packets. */
  else {
    accept = false;
  }

  LWIP_DEBUGF(LWIP_DBG_ON | LWIP_DBG_TRACE,
              ("should_accept_ethernet_packet: %s, dest:%"X8_F":%"X8_F":%"X8_F":%"X8_F":%"X8_F":%"X8_F", src:%"X8_F":%"X8_F":%"X8_F":%"X8_F":%"X8_F":%"X8_F", type:%04"X16_F"\n",
               (accept) ? "TRUE" : "FALSE",
               (unsigned char)pktHdr->dest.addr[0], (unsigned char)pktHdr->dest.addr[1], (unsigned char)pktHdr->dest.addr[2],
               (unsigned char)pktHdr->dest.addr[3], (unsigned char)pktHdr->dest.addr[4], (unsigned char)pktHdr->dest.addr[5],
               (unsigned char)pktHdr->src.addr[0],  (unsigned char)pktHdr->src.addr[1],  (unsigned char)pktHdr->src.addr[2],
               (unsigned char)pktHdr->src.addr[3],  (unsigned char)pktHdr->src.addr[4],  (unsigned char)pktHdr->src.addr[5],
               lwip_htons(pktHdr->type)));

  return accept;
}

The above code has been tested using the LwIP sample application running inside a linux process. I can confirm that it properly accepts DHCP, ARP, UDPv4 and TCPv4 traffic, including both initiating and accepting TCP connections.

When looking at the log output, type:0800 indicates an IPv4 packet, type:0806 is an ARP packet and type:86DD is an IPv6 packet (which should be dropped). All FF's in the dest address indicates a broadcast packet.

Quote from: peter-h on January 28, 2023, 09:33:31 am

curiously dropping non-ARP broadcast packets in low_level_input (as discussed above) does not avoid the need for more RX buffers at the PHY level. I still need 3+ there. So that issue is more subtle - assuming that my multicast dropping above is actually working.

As I mentioned earlier, this code protects the second queuing point for incoming packets; specifically, the input queue to the LwIP tcpip thread. It doesn't nothing to prevent the Ethernet controller from running out of receive buffers (which constitutes the first queuing point). To deal with that, you need to do one or more of the following: increase the polling frequency of your ethernetif_input task, switch to an interrupt-driven wake for the ethernetif_input task, and/or increase the number of receive buffers.

Quote

Another interesting data point is that low_level_input executes in 2us if there is no rx data, and 24-28us if there is a packet to be stuffed into LWIP. This puts a bit of perspective on whether an ISR is useful or not. 168MHz 32F417.

Given how short this is, I suspect the polling frequency is the dominate factor in whether the Ethernet controller runs out of buffers. If I understand correctly, the interval you've set is 10ms idle/2ms with traffic, which is fairly long.

peter-h · « **Reply #51 on:** January 28, 2023, 02:14:22 pm »

That is great - all working.

Quote

I can confirm that it properly accepts DHCP, ARP, UDPv4 and TCPv4 traffic, including both initiating and accepting TCP connections.

I am also doing NTP (client and server) and potentially other stuff. Do I need to put those in also? I thought most of that is done with UDP packets addressed to a specific IP, which was obtained previously somehow, possibly by DNS?

Quote

As I mentioned earlier, this code protects the second queuing point for incoming packets; specifically, the input queue to the LwIP tcpip thread. It doesn't nothing to prevent the Ethernet controller from running out of receive buffers (which constitutes the first queuing point). To deal with that, you need to do one or more of the following: increase the polling frequency of your ethernetif_input task, switch to an interrupt-driven wake for the ethernetif_input task, and/or increase the number of receive buffers.

I wonder why this is not a much more widespread issue.

Those broadcast packets would trigger a pause which was never less than 1-3 seconds. This didn't look right as being the "first backoff timeout for a missing packet" for TCP, because that would be ridiculous and would really hit performance. I would expect the first backoff to be maybe 100-300ms and this not noticeable in most applications. And the length of the pause meant it was obviously generated within LWIP.

Quote

Given how short this is, I suspect the polling frequency is the dominate factor in whether the Ethernet controller runs out of buffers. If I understand correctly, the interval you've set is 10ms idle/2ms with traffic, which is fairly long.

Yes. The problem is that with polling and using the osDelay() function (which is desirable because it releases control to run other RTOS tasks) is that a parameter of 1 gives you 0 to 1ms, so I use osDelay(2) if I actually want some sort of delay (1-2ms). But osDelay(1) will still release control. So I made the "fast poll" that value, and it all still seems to run fine. But it didn't help at all with the ETH buffer size at which LWIP starts to fall over (2 buffers), so I went back to (2) because it definitely provides more time for other RTOS tasks.

Re interrupt driven RX, I don't want to put in that time right now because I am trying to get onto other stuff (this product is nearly done) but what would help would be a very simple ISR which just sets a "there is ETH RX data" flag. In my polling loop I could then execute that osDelay(2) only if that flag is not set. That would be a reasonable and safe compromise. Not sure how to set this up though. I have done timer ISRs before but never played with the ETH controller. The "HAL" code sets up an ISR which then tests every possible source, which is ridiculous. And the interrupt source needs to be cleared correctly so the ISR doing nothing may not work unless it just clears the IP bit or some such.

OTOH, remaining in a delay-free polling loop until the ETH buffers are empty will do the same thing.

ejeffrey · « **Reply #52 on:** January 28, 2023, 03:54:15 pm »

You might force the line rate down to 10 megabit. The network switch will have considerably more buffer capacity than your code so pushing some of the buffering back there might help.

I'm not what the if freertos has this, but the way to use an ISR for this is to use something like a condition variable or event wait instead of sleep. Your network task would wait on the event or a timeout and the ISR would signal the event causing the wait to unblock and be scheduled according to its priority.

dare · « **Reply #53 on:** January 28, 2023, 04:53:41 pm »

Quote from: peter-h on January 28, 2023, 02:14:22 pm

I am also doing NTP (client and server) and potentially other stuff. Do I need to put those in also? I thought most of that is done with UDP packets addressed to a specific IP, which was obtained previously somehow, possibly by DNS?

Both NTP and DNS are UDP, and typically unicast. So it shouldn't be a problem. However, if you intend to use mDNS, you will need to adjust the filter.

Quote

I wonder why this is not a much more widespread issue.

Those broadcast packets would trigger a pause which was never less than 1-3 seconds. This didn't look right as being the "first backoff timeout for a missing packet" for TCP, because that would be ridiculous and would really hit performance. I would expect the first backoff to be maybe 100-300ms and this not noticeable in most applications. And the length of the pause meant it was obviously generated within LWIP.

I agree, this seems like something other than simple packet loss. I still think it would be helpful to see a debug log covering the entire pause period, including a few seconds of good communication on both sides of the pause.

Quote

Re interrupt driven RX, I don't want to put in that time right now because I am trying to get onto other stuff (this product is nearly done) but what would help would be a very simple ISR which just sets a "there is ETH RX data" flag. In my polling loop I could then execute that osDelay(2) only if that flag is not set. That would be a reasonable and safe compromise. Not sure how to set this up though. I have done timer ISRs before but never played with the ETH controller. The "HAL" code sets up an ISR which then tests every possible source, which is ridiculous. And the interrupt source needs to be cleared correctly so the ISR doing nothing may not work unless it just clears the IP bit or some such.

Well, in fact, this is exactly how the ST example code works. The interrupt fires and does little more that set a FreeRTOS semaphore. The ethernet input task (ethernetif_input) waits on this semaphore and calls low_level_input() in a loop when it fires. Not the most efficient design (it requires two context switches to deliver a packet from a dead stop). But it's much more responsive than polling.

Have you looked at the code in the ST LwIP example? It seems to have all functionality needed to use interrupts. https://www.st.com/en/embedded-software/stsw-stm32070.html

One question for you: Are you aware of the issue with cache coherency and the DMA controller? Lot of discussion about it on the ST board (e.g. https://community.st.com/s/question/0D53W00001Z9K9TSAV/maintaining-cpu-data-cache-coherence-for-dma-buffers). Now maybe the code you're using has already fixes for this. But if it doesn't, then perhaps the problem is that sometimes the cache lines holding the DMA descriptors aren't being flushed to main memory promptly, meaning that the Ethernet DMA controller doesn't see them right away, causing a pause in reception ... just a thought.

Seems like someone needs to put together an LwIP example app for the STM32F4xx with all the known issues fixed. There's one here for the STM32H7 here: https://community.st.com/s/question/0D50X0000C6eNNSSQ2/bug-fixes-stm32h7-ethernet. If I had a dev board with me I would take a run at writing this myself.

peter-h · « **Reply #54 on:** January 28, 2023, 08:06:41 pm »

Quote

Your network task would wait on the event or a timeout and the ISR would signal the event causing the wait to unblock and be scheduled according to its priority.

Yes you can do that. I am not sure what the gain is though. If data arrives, you have to process it. So you may as well use an ISR to process it.

Quote

I still think it would be helpful to see a debug log covering the entire pause period, including a few seconds of good communication on both sides of the pause.

I can get you that but it would be a huge amount of text.

Quote

Have you looked at the code in the ST LwIP example? It seems to have all functionality needed to use interrupts. https://www.st.com/en/embedded-software/stsw-stm32070.html

That one is behind their usual "security" where they send you a link and you download it. I just have; it's a huge library. I will take a look at it later. There is a procedure for rolling this up into a Cube project, which I've never done. I've always just written new modules (.c and .h, mainly). But I will take a look.

The trouble is that practically everything from ST doesn't actually work properly. A lot of the code is clearly written by people will little or no embedded experience. It works OK on the demo boards they sell. My "Monday afternoon guy" spend a year or two of Monday afternoons fixing the original ST code. I am glad this is now well behind and I have been able to get on with real stuff. Well, he has just spent a few months of Mondays on fixing MbedTLS... My problem is that I don't have a deep expertise of C, or so much other ARM32 stuff, so while I do good solid code in the end, it takes me a long time, and I have to pick my battles carefully.

Quote

Are you aware of the issue with cache coherency and the DMA controller?

Yes; this has come up lots of times but each time turned out to be irrelevant to the 32F4. Like the __DMB here

Code: [Select]

	if (bufcount == 1U)
	{
		/* Set LAST and FIRST segment */
		heth->TxDesc->Status |=ETH_DMATXDESC_FS|ETH_DMATXDESC_LS;
		/* Set frame size */
		heth->TxDesc->ControlBufferSize = (FrameLength & ETH_DMATXDESC_TBS1);
		/* Set Own bit of the Tx descriptor Status: gives the buffer back to ETHERNET DMA */
		//__DMB();
		heth->TxDesc->Status |= ETH_DMATXDESC_OWN;
		/* Point to next descriptor */
		heth->TxDesc= (ETH_DMADescTypeDef *)(heth->TxDesc->Buffer2NextDescAddr);
	}

I don't fully understand why. I think the 32F4 doesn't have the data cache. The only people who do are wek and a guy called "Piranha" on the ST forum who likes to tell people how stupid they are

I think it is all M7 related. Quoting from that ST post:

Quote

So, from the five companies, which have been involved in making MCUs with Cortex-M7, all five have failed with this. And since 2014, when the Cortex-M7 has been announced, ARM also haven't stepped up and helped to fix the situation by giving a clear explanation and examples.

Quote

Seems like someone needs to put together an LwIP example app for the STM32F4xx with all the known issues fixed. There's one here for the STM32H7 here: https://community.st.com/s/question/0D50X0000C6eNNSSQ2/bug-fixes-stm32h7-ethernet. If I had a dev board with me I would take a run at writing this myself.

I've offered people bits of money on various occassions

I use freelancer.com quite a bit for little isolated projects. One really good guy wrote some client-side JS for file upload and download (for an HTTP server which I wrote myself, after a project which was based on ST's HTTPD was aborted due to the subcontractor running way past the budget) and for that it works well. Others wrote a USB VCP loopback tester (which I used extensively on this project; it has four serial ports and a 5th is a USB VCP), a graphical FreeRTOS stack space viewer, etc.

I had a look at the interrupt code and found the ETH IRQ handler in the CT Cube code. There is only about one IRQ for the ETH

Code: [Select]


/**
 *
  * @brief  This function handles ETH interrupt request.
  * @param  heth pointer to a ETH_HandleTypeDef structure that contains
  *         the configuration information for ETHERNET module
  *
  * This ISR is never called because all the stuff is polled. The vector points
  * to ETH_IRQHandler which currently doesn't exist.
  * In fact since nothing references this function, it gets optimised away.
  *
  */
void HAL_ETH_IRQHandler(ETH_HandleTypeDef *heth)
{
  /* Frame received */
  if (__HAL_ETH_DMA_GET_FLAG(heth, ETH_DMA_FLAG_R)) 
  {
    /* Receive complete callback */
    HAL_ETH_RxCpltCallback(heth);
    
     /* Clear the Eth DMA Rx IT pending bits */
    __HAL_ETH_DMA_CLEAR_IT(heth, ETH_DMA_IT_R);

    /* Set HAL State to Ready */
    heth->State = HAL_ETH_STATE_READY;
    
    /* Process Unlocked */
    __HAL_UNLOCK(heth);

  }
  /* Frame transmitted */
  else if (__HAL_ETH_DMA_GET_FLAG(heth, ETH_DMA_FLAG_T)) 
  {
    /* Transfer complete callback */
    HAL_ETH_TxCpltCallback(heth);
    
    /* Clear the Eth DMA Tx IT pending bits */
    __HAL_ETH_DMA_CLEAR_IT(heth, ETH_DMA_IT_T);

    /* Set HAL State to Ready */
    heth->State = HAL_ETH_STATE_READY;
    
    /* Process Unlocked */
    __HAL_UNLOCK(heth);
  }
  
  /* Clear the interrupt flags */
  __HAL_ETH_DMA_CLEAR_IT(heth, ETH_DMA_IT_NIS);
  
  /* ETH DMA Error */
  if(__HAL_ETH_DMA_GET_FLAG(heth, ETH_DMA_FLAG_AIS))
  {
    /* Ethernet Error callback */
    HAL_ETH_ErrorCallback(heth);

    /* Clear the interrupt flags */
    __HAL_ETH_DMA_CLEAR_IT(heth, ETH_DMA_FLAG_AIS);
  
    /* Set HAL State to Ready */
    heth->State = HAL_ETH_STATE_READY;
    
    /* Process Unlocked */
    __HAL_UNLOCK(heth);
  }
}

and that then calls various other bits. I really don't want interrupt driven TX because AIUI the packets just pass through LWIP (following a socket write call by the application) and end up in low_level_output, which immediately loads them into the ETH buffer(s). I spent some time measuring performance for different ETH TX buffer numbers and found anything above 2 is a waste of RAM (may not be for 4G etc).

Thank you for the time you have spent helping me

Siwastaja · « **Reply #55 on:** January 29, 2023, 07:15:04 am »

Also remember that TCP is supposed to be crappy over slow/unreliable links when one wants any kind of real-time responsiveness. Wrong protocol for the job if you can't accept a few seconds of delay after losing a packet; although you could make it better by tuning.

Everyone who has tried to type command over SSH connection even with 1-2% packet loss or 500ms ping knows the pain.

It has got better thanks to decades and decades of tuning, but tuning works given right protocol implementations and certain types of payloads. For example, I think SSH and HTTP work better now than 15-20 years ago; back then losing a few packets possibly meant waiting for so many seconds to the point of hitting F5 manually on a browser was faster than waiting for the TCP to recover!

If you need quick response time under packet loss and/or small memory footprint, you are better off designing your own application-specific simple protocol. Usually on top of UDP. This is what games too.

peter-h · « **Reply #56 on:** January 29, 2023, 10:24:14 am »

Yes; this is why initially I thought these long gaps are normal. And they rarely occur with short packets, presumably because short packets pass by faster. In the vast majority of applications would not be detected by the customer.

But they should not be normal; for a start they are quite noticeable in any human user scenario. Or if running some serial to ethernet to serial application, you will need to set rather long timeouts to avoid errors. Note that data was not usually being lost. It was just being delivered with long gaps so e.g. a 1000 byte test packet would be delivered as 300 300 200 200 with gaps in between.

Also AFAICT the timeout in my LWIP config is 250ms, not the "3 seconds" which tends to turn up on google when searching for a lost packet timeout. Also 3 secs would be quite ridiculous.

I still think it is suspicious because the gaps are not even. But it is easy to reproduce. Just set buffers=2, recompile, and run, and right away you see it. It gets better after a minute or so, which makes sense as the frequency of broadcast packets reduces after an initial busy period when a new box is connected to the LAN.

I am now back to running with the "busy" TP-Link QOS switch again, with all the other stuff on it including a VPN to another site, since it is a better test environment. That same switch produced a lot of trouble in the past:
https://www.eevblog.com/forum/networking/bizzare-issue-with-a-linksys-lgs116p-qos-switch-fake-mac-related/msg3622359/#msg3622359

My box is using a real MAC # this time.

On a more bizzare topic: I've increased the ETH TX buffers from 2 to 4 and it looks like I will have another interesting data point. But I won't know for sure until the evening. But surely this has to be pretty equivalent to giving LWIP more TX buffers. But I can't see any lwipopts.h setting for TX buffering (did many hours of googling in the past) hence I think there isn't one, and LWIP just takes the buffer you pass to it when writing a socket and sends it either direct (if < MTU) or sends it in multiple chunks. It has to prefix and postfix each chunk on the way to the PHY layer, but AFAICT there is no "TX buffering" as such. Never found anybody who knew about this... but that is how I would do it. No point in buffering for fun, and this method also makes it possible to do "zero copy TX" which I think LWIP actually does. That also explains why a socket write is blocking for ever, even on a non blocking socket, until the next RTOS tick.

Zero copy RX is a whole different animal, and has been done, but only with oblique references on forums and little if any working code.

EDIT: the TX buffer increase benefit is hard to reproduce. I am not surprised because, during extensive previous tests, the data disappeared down the wire so fast that above 2 made no difference at all.

peter-h · « **Reply #57 on:** January 30, 2023, 01:27:43 pm »

Had a quick look at the history of why this project doesn't use ETH RX interrupts.

It was found that with the reduced (RMII) PHY interface, the LAN8742's IRQ output is shared with an LED driver and thus cannot be used to generate interrupts if you want the LED function. This may be nonsense because it does appear that the 32F4 can generate an IRQ wholly internally, regardless of whether the PHY is MII or RMII. But this was never tested. It may have been something to do with the ST development board originally used.

If that IRQ works, it points to here

Code: [Select]


void HAL_ETH_IRQHandler(ETH_HandleTypeDef *heth)
{
  /* Frame received */
  if (__HAL_ETH_DMA_GET_FLAG(heth, ETH_DMA_FLAG_R)) 
  {
    /* Receive complete callback */
    HAL_ETH_RxCpltCallback(heth);
    
     /* Clear the Eth DMA Rx IT pending bits */
    __HAL_ETH_DMA_CLEAR_IT(heth, ETH_DMA_IT_R);

  }
  /* Frame transmitted */
  else if (__HAL_ETH_DMA_GET_FLAG(heth, ETH_DMA_FLAG_T)) 
  {
    /* Transfer complete callback */
    HAL_ETH_TxCpltCallback(heth);
    
    /* Clear the Eth DMA Tx IT pending bits */
    __HAL_ETH_DMA_CLEAR_IT(heth, ETH_DMA_IT_T);

  }
  
  /* Clear the interrupt flags */
  __HAL_ETH_DMA_CLEAR_IT(heth, ETH_DMA_IT_NIS);
  
  /* ETH DMA Error */
  if(__HAL_ETH_DMA_GET_FLAG(heth, ETH_DMA_FLAG_AIS))
  {
    /* Ethernet Error callback */
    HAL_ETH_ErrorCallback(heth);

    /* Clear the interrupt flags */
    __HAL_ETH_DMA_CLEAR_IT(heth, ETH_DMA_FLAG_AIS);
  
   }
}

(I've removed the superfluous code)

Looking at the interrupt config in the RM, there is a vast number of interrupts which can be enabled, and a vast range of IP bits which need to be cleared accordingly. So unless somebody really understands the hardware, there is a lot of code that needs to be done right.

Daixiwen · « **Reply #58 on:** January 31, 2023, 08:36:18 am »

I'm not sure I understood what you wrote correctly, but the interrupt from the LAN8742 is only used for events from the PHY itself. Those are usually status changes, such as autonegotiation result and cable connect/disconnect. You can see the interrupt sources in the datasheet here on page 87. The Ethernet receive interrupt is completely different and comes from the MAC or the DMA, and as you describe is coming from the 32F4 itself. It should have nothing to do with the PHY IRQ line.

peter-h · « **Reply #59 on:** January 31, 2023, 08:48:23 am »

Thanks - that makes sense. So it was a misunderstanding right at the start. Currently we poll the status, by calling this about once a second

Code: [Select]



// Called from ETHIF thread. Detects link change and resets the link state.

static void ethernetif_set_link(struct netif *netif)
{
	uint32_t regvalue = 0;

	// Read PHY_BSR
	HAL_ETH_ReadPHYRegister(&EthHandle, PHY_BSR, &regvalue);

	regvalue &= PHY_LINKED_STATUS;

	// Check whether the netif link down and the PHY link is up
	if (!netif_is_link_up(netif) && (regvalue))
	{
		/* network cable is connected */
		netif_set_link_up(netif);
	}
	else if (netif_is_link_up(netif) && (!regvalue))
	{
		/* network cable is disconnected */
		netif_set_link_down(netif);
	}

}

IOW, this reads the LAN8742 status reg and sees if it has changed.

This is all done in low_level_input which as I mention further back is polled every 10ms, reducing to 2ms if there is rx data, reverting to 10ms after 10ms of no data.

Daixiwen · « **Reply #60 on:** January 31, 2023, 09:06:49 am »

Yes indeed once a second is more than enough. How quickly do you want your application to react to a cable connect / disconnect?

I am not familiar with the 32F4 (or any STM32 in face), but there should be a "new packet" interrupt from the MAC or the RX DMA that you could use instead of polling to have a better performance. From your code snippet it could be ETH_DMA_IT_R but it would need some testing to see how it needs to be setup.

peter-h · « **Reply #61 on:** January 31, 2023, 09:55:38 am »

Indeed; as discussed further back, interrupt RX would deliver a higher performance. I just don't want to do it right now because I don't think there is any proven working code available (for the 32F4; various code snippets have appeared for others) and the ST Cube MX demo code (for their demo board(s)) worked but was not reliable.

There will always be the issue that with a broadcast storm at 100mbps you get interrupts at - depending on actual packet size - around 500kHz (2us), but the time to extract a packet from the ETH PHY buffers has been measured at way more than that. Genuine packets take ~30us to process (from ETH PHY buffers to the back end of LWIP, using memcpy in 32 bit chunks; DMA would not help especially on small packets) and non-ARP broadcasts, while discarded (unless this is configured OFF, which it may be if you want to do "UPNP" i.e. product discovery) take about 10us. Even the simplest code to service an interrupt and discard the packet based on its header will be ~10us including the ISR overhead. So it would be quite easy to totally kill (DOS) this product on a bad LAN, unless more complication is introduced.

Link change detect at say 1Hz is easily ok.

Daixiwen · « **Reply #62 on:** January 31, 2023, 11:58:24 am »

Yes I can understand the problem.
Another easier optimization that I have used, since the driver seems to be using several receive buffers, is to make sure that the DMA already reads the next packet while you are processing the current packet. Then you have the possibility to poll the rx status right after you are finished processing one packet, and under heavy traffic it's probable you would already have a new packet waiting for you. The main problem with this solution, a bit like what you mentioned with the interrupts, is that you may make the application vulnerable to DOS by sending a lot of packets, as the receive task usually has a higher priority than the rest of the application. You could have a counter so that the receive task yeilds control back to the OS after n packets and find the right balance between performance and vulnerability.

peter-h · « **Reply #63 on:** January 31, 2023, 12:26:13 pm »

Yes; I have tested this too. In the polled model, it is easy. This is the entire RTOS task for ETH RX:

Code: [Select]

void ethernetif_input( void * argument )
{

	struct pbuf *p;
	struct netif *netif = (struct netif *) argument;
	uint32_t link_change_check_count = ETH_LINK_CHANGE_COUNT;

	// Define RX activity timer, for dropping fast poll down to slow poll
	TimerHandle_t *rxactive_timer = xTimerCreate("ETH RX active timer", pdMS_TO_TICKS(ETH_SLOW_POLL_DELAY), pdFALSE, NULL, RXactiveTimerCallback);

	// Start "rx active" timer
	xTimerStart(rxactive_timer, 20);	// 20 is just a wait time for timer allocation

	do
    {

		p = low_level_input( netif );	// This sets rxactive=true if it sees data

		if (p!=NULL)
		{
			if (netif->input( p, netif) != ERR_OK )
			{
				pbuf_free(p);
			}
		}

		if (rxactive)
		{
			rxactive=false;
			// Seen rx data - reload timeout
			xTimerReset(rxactive_timer, 20);	// Reload "rx active" timeout (with ETH_SLOW_POLL_DELAY)
			// and get osDelay below to run fast
			rx_poll_period=ETH_RX_FAST_POLL_INTERVAL;
		}

		// This has a dramatic effect on ETH speed, both ways (TCP/IP acks all packets)
		osDelay(rx_poll_period); // 2ms or 10ms

		// Do ETH link status change check

		link_change_check_count--;
		if (link_change_check_count==0)
		{
			// reload counter
			link_change_check_count = ETH_LINK_CHANGE_COUNT;

			// Get most recently recorded link status
			bool net_up = netif_is_link_up(&g_netconf_netif);

			// Read the physical link status
			ethernetif_set_link(&g_netconf_netif);

			// Has the link status changed
			if (net_up != netif_is_link_up(&g_netconf_netif))
			{
				ethernetif_update_config(&g_netconf_netif);

				if (net_up) {
				   // Link was up so must have dropped
				   debug_thread_printf("Ethernet link down");
				}
				else {
				   // Link was down so must be up - restart DHCP
				   debug_thread("Ethernet link up");
				   network_restart_DHCP();
				}
			}
		}

    } while(true);

}

Half of it is the link change detect stuff, so forget that.

It is easy to rig this so that if p!=NULL here

Code: [Select]


		if (p!=NULL)
		{
			if (netif->input( p, netif) != ERR_OK )
			{
				pbuf_free(p);
			}
		}

the call below to osDelay() is skipped, so you go right back to testing if there is another packet (and probably skip the link change stuff as well).

That that takes you right back to being easily DOSable. So what can one do? As you suggest, a counter for the max # of consecutive packets thus retrieved? That will give good performance on a burst of packets. But only if the burst is "just right".

Much depends on what other code you are running, because obviously all the time you are in this loop you won't be yielding back to the RTOS (with osDelay(1) being the simplest example) so at 1kHz Systick nothing else will run for 1ms, except ISRs.

You could spend your whole life optimising some scenario

but because this CPU doesn't have a hardware PHY-layer packet filter, it will never work properly is you get a packet flood. The 32F4 can dump broadcast packets but that's OK only (AFAICT) for a box which is addressed by the MAC #.

nctnico · « **Reply #64 on:** January 31, 2023, 12:31:41 pm »

Yep. On a constrained platform you have to just drop ethernet packets at some point. Every protocol that runs on ethernet is designed to deal with packet loss so you won't lose any functionality. Only realtime performance will suffer. In some of the ethernet based designs I have made, the MAC interface is bitbanged using GPIO pins (with a switch chip as PHY that can also do buffering).

peter-h · « **Reply #65 on:** January 31, 2023, 01:21:50 pm »

Quote

Every protocol that runs on ethernet is designed to deal with packet loss so you won't lose any functionality.

Except LWIP which seemingly can't handle a broadcast packet at the same time as some other packet... Perhaps this issue occurs only if

- you have just 2 buffers
- 1 gets a broadcast
- 1 gets some other
- and then there is a gap

Otherwise, in my test setup, with broadcasts at ~1Hz, this would be happening at the broadcast frequency, but actually it is very rare.

Or maybe it is a bug in some of the low level code, which disappears when you have 3+ buffers.

Quote

the MAC interface is bitbanged using GPIO pins

Isn't that very slow?

nctnico · « **Reply #66 on:** January 31, 2023, 01:43:06 pm »

Quote from: peter-h on January 31, 2023, 01:21:50 pm

Quote
the MAC interface is bitbanged using GPIO pins

Isn't that very slow?

Faster compared to a 14k4 modem

peter-h · « **Reply #67 on:** January 31, 2023, 03:27:43 pm »

Sure, TCP/IP delivers end to end error correction and flow control, which makes this 2-buffer+broadcast thing even more of a mystery.

At the PHY level, a broadcast is handled like any other packet, AIUI. Only inside LWIP is there any difference. Is that right?

Several days of testing now, with no errors.

ejeffrey · « **Reply #68 on:** January 31, 2023, 08:50:21 pm »

This is another area where a packet sniffer would be extremely helpful. You could see exactly what broadcast (or other!) packets are being sent closely with the data that is being dropped. It might be that some part of the network is e.g., triggering an ARP request synchronous with your data packets, causing multiple sequential overflows even on a relatively low utilization link.

TCP interprets packet loss as congestion, and uses an exponential backoff algorithm to avoid making the congestion worse by sending lots of retries. That assumption breaks down when the packet drops are not really due to congestion. This comes up with wireless communication where interference or signal fade can cause a burst of packet drops, which can cause all the TCP sessions to back off. Once the link comes back, it may be immediately available at 100% capacity, but it takes a while for all the stalled TCP connections to ramp back up.

2 packet buffers gives you only a small margin for error, and also your 10 ms polling interval could also be causing problems. In some sense it's not surprising that you are having trouble, but it's also hard to say without seeing exactly what is happening.

peter-h · « **Reply #69 on:** January 31, 2023, 10:27:15 pm »

FWIW, I tried a near-zero polling interval and it did the same thing.

Also I am running a test with 2 sessions, with randomly timed packets, and for sure there would be 2 packets arriving "concurrently" and grabbing those two buffers, quite often.

Yet the problem surfaced only with broadcasts present.

The problem is that LWIP is immensely complicated. The only thing more horrible is MbedTLS, although to be fair 90% of that is crypto suites which they pulled in from all over the net. Half of the config in lwipopts.h (and opt.h) is a mystery, evidently to a zillion devs desperately posting all over the www over past 15+ years. I guess if you know tcp/ip from the fundamentals (like a 1970s Computer Science academic

) then you will recognise most of it because your brain is already "wired right". FWIW I recompiled the project with every single LWIP option doubled up, all tested separately, and nothing made any difference. This thing is something deep down. May have been some task deadlock between LWIP and the RTOS, and if it was, it would never be fixed; not by somebody on my pay grade.

Anyway it is fixed now

Daixiwen · « **Reply #70 on:** February 01, 2023, 09:45:21 am »

Yes it's hard to follow the code in LWIP. I try to avoid it as much as I can

. But this behavior looks like a bug.
Maybe you could implement your own filter in the eth rx function, that would automatically drop any broadcast packet that isn't an ARP request, before giving them to LWIP? It feels a bit like fixing a problem with rope and gaffer tape, but sometimes you just need to move forward with the project...

peter-h · « **Reply #71 on:** February 01, 2023, 10:10:08 am »

Yes that was done - see posts by dare further back.

I have this in there, user configurable via a config file entry.

That filter does enable reliable operation with eth rx buffers=2 but not completely so, so even ARP broadcasts cause the problem (though far less often).

So it isn't a CPU issue and in any case I don't think the CPU treats broadcasts differently (if that feature to dump them is not used).

Probably some deadlock issue with lwip+rtos. For example, when playing around, I found that you can read data from a socket with rtos task switching disabled, but if you try a socket write, it hangs. So a non blocking socket does actually block until the next rtos tick. I can see why; lwip needs to package the data and send it out, which it can't do unless there is room lower down, and any kind of state machine runs on ticks.

Daixiwen · « **Reply #72 on:** February 02, 2023, 08:54:38 am »

Sorry about that. I have to admit I only read the beginning of the thread a while back and the messages from last week. I just had a glance over the rest and missed it.
I have more experience with the BSD IP stack and the eCos operating system, but could your problem be related to priority inversion? We had the problem once on an application because the Ethernet RX thread was running on the highest priority, the rest of the IP stack on a high priority and the application on a lower priority. Without priority inheritance we had all kind of weird problems when the network traffic got higher, especially when trying to send and receiving at the same time. Ideally the RX and TX paths should use mutually exclusive synchronization mechanisms but it's not possible with every kind of hardware out there.
If you have at least 3 threads on different priorities all using the same mutex/semaphore for synchronization it may be something to look into:
https://www.digikey.no/en/maker/projects/introduction-to-rtos-solution-to-part-11-priority-inversion/abf4b8f7cd4a4c70bece35678d178321

peter-h · « **Reply #73 on:** February 02, 2023, 09:54:11 am »

Yes; could be.

The lowest level ETH RX (LOW_LEVEL_INPUT) runs at osPriorityRealtime = 48. This stuffs the eth rx packets into LWIP. No mutexes are involved with this thread.

LWIP runs at osPriorityHigh = 40.

Applications are supposed to run at a lower priority than osPriorityHigh and mine is running at tskIDLE_PRIORITY = 0.

The question is whether a lower priority task can set a mutex which blocks a higher priority task. The only mutex I can think of which might be active here is the LWIP one, set up in lwipopts.h:

Code: [Select]

// Flag to make LWIP API thread-safe. The netconn and socket APIs are claimed
// to be thread-safe anyway. The raw API is never thread-safe.
#define LWIP_TCPIP_CORE_LOCKING   	1

This is poorly documented but tracing the code down, it appears to set up a mutex around the API calls. This implements quite a coarse exclusion than what LWIP is claimed to do without this mode set but is claimed to be more efficient because LWIP implements thread-safety by using a messaging system internally which takes up more processing. That's my primitive understanding; I never found anyone who could give a straight answer (despite vast numbers of questions on the www, and on the nearly-dead LWIP mailing list). Anyway, using this option or not makes no difference to the issues I had.

But there is only one mutex for LWIP and I can see that if you lock a socket on e.g. a read then another thread doing any socket operation will get blocked. This will obviously be a problem if someone is using a blocking read socket; these hang until data arrives. For writes, this should not hang anything because AIUI all socket writes end up with data disappearing down the wire anyway.

This description is curious - here:https://www.nongnu.org/lwip/2_0_x/group__lwip__opts__lock.html
LWIP_TCPIP_CORE_LOCKING Creates a global mutex that is held during TCPIP thread operations. Can be locked by client code to perform lwIP operations without changing into TCPIP thread using callbacks. See LOCK_TCPIP_CORE() and UNLOCK_TCPIP_CORE(). Your system should provide mutexes supporting priority inversion to use this.

Does anyone know what "without changing into TCPIP thread using callbacks" means?

I will continue testing with LWIP_TCPIP_CORE_LOCKING = 0 and see if anything breaks

Currently I have several RTOS tasks all using either the Netconn API (an HTTP server I wrote) or the socket API, all running fine.

peter-h · « **Reply #74 on:** February 04, 2023, 09:32:02 am »

Just an update: further input on LWIP_TCPIP_CORE_LOCKING suggests 0 (disabled) should not have any advantages in any circumstances compared with 1 (enabled).

I thought that testing showed better reliability under heavy load (multiple RTOS tasks all using netconn and sockets) with 0 but can't really replicate it, and it could be mixed up with PBUF buffer allocation dependencies.

1 uses a mutex at a high level - at the API level - so will lock out TCP tasks at that high level. This may not matter, but could result in "priority inversion" in that a low priority task using TCP will block a higher priority one, for the entire duration of the LWIP API call. I now possibly understand the bit here https://www.nongnu.org/lwip/2_0_x/group__lwip__opts__lock.html
Your system should provide mutexes supporting priority inversion to use this.
I have no idea what such a mutex would be (a mutex is a mutex, and has to be used with care, no?) but whether this matters depends. It doesn't make sense for a task using TCP comms to have a very high priority anyway

And for sure the effect will depend on how much buffering you have allocated to the ETH PHY TX; I don't think LWIP has any TX buffering configured anywhere. Helping this will be that most if not all packets disappear down the wire at 10mbytes/sec regardless, especially if Nagle is disabled which seems a good idea in embedded apps anyway.

Dare's multicast filtering function works wonderfully

The only possible issue is that if one day I want to implement discovery, then processing broadcasts will be necessary. Then I guess the function would be modified with accept broadcast packets with a specific payload.

Later edit in case somebody finds this in the future:

The question of whether LWIP is thread safe is nontrivial. Apparently it is even without LWIP_TCPIP_CORE_LOCKING=1, for the netconn and socket APIs but it is never thread-safe for the raw API. Anyway, originally the extra locking has been implemented as it is claimed to improve performance. More info:
https://www.nongnu.org/lwip/2_1_x/multithreading.html
https://www.nongnu.org/lwip/2_1_x/pitfalls.html

The 2nd URL above says the RAW API is also thread-safe if core locking=1.

While the consensus is that LWIP_TCPIP_CORE_LOCKING=1 should be the default usage, it was experimentally found that the system is much more reliable with =0, so these two have been set to 1 as per the above docs.
#define LWIP_ALLOW_MEM_FREE_FROM_OTHER_CONTEXT 1
#define SYS_LIGHTWEIGHT_PROT 1
This is necessary especially as, in the commonly used ex-ST low_level_input and low_level_output code, other RTOS tasks are calling e.g. pbuf_free() in ethernetif.c.

Why the system is much more reliably with LWIP_TCPIP_CORE_LOCKING=0 I don't know. The difference is found only when I have around 5 RTOS tasks running, some HTTP, some HTTPS, and other stuff, some using netconn API and some using the socket API.


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

Author Topic: Anyone here familiar with LWIP? (Read 17140 times)

Share me