Author Topic: LWIP may get stuck in the select function  (Read 2918 times)

0 Members and 1 Guest are viewing this topic.

Offline tilblackoutTopic starter

  • Contributor
  • Posts: 31
  • Country: cn
LWIP may get stuck in the select function
« on: September 18, 2023, 09:09:39 am »
The system is using FreeRTOS and LwIP, and there is a sporadic issue where the program gets stuck in the select function.

My program uses two network interfaces, one for PPP dial-up to access the external network, and the other using Ethernet. I have two tasks that call select: one to monitor the read fd for all sockets, and another when writing data to Ethernet, to monitor its write fd. If the select call fails during the write operation, it suggests that LWIP may be running out of memory, and I refrain from writing.

After running the program overnight, it typically gets stuck after seven to eight hours, with both tasks stuck inside the select function. How can I pinpoint the issue, as it's difficult to reproduce and appears to be related to the non-reentrant nature of select?

I've noticed two macro definitions: LWIP_NETCONN_SEM_PER_THREAD and LWIP_NETCONN_FULLDUPLEX. Would enabling these macros help avoid this situation?

I have also observed that when a socket being monitored by select is closed by another task before the select call reaches its timeout, the socket cannot be properly reclaimed. Is this also related to these two macro definitions?

I would greatly appreciate the assistance you have provided. ;D
 

Online nctnico

  • Super Contributor
  • ***
  • Posts: 26907
  • Country: nl
    • NCT Developments
Re: LWIP may get stuck in the select function
« Reply #1 on: September 18, 2023, 12:01:03 pm »
You need some kind of locking. You have to check the LWIP source code to see which locking mechanism select() uses and enable that. IIRC Lwip has two thread locking options but I don't know which does what.
There are small lies, big lies and then there is what is on the screen of your oscilloscope.
 
The following users thanked this post: tilblackout

Offline ledtester

  • Super Contributor
  • ***
  • Posts: 3036
  • Country: us
Re: LWIP may get stuck in the select function
« Reply #2 on: September 18, 2023, 04:59:36 pm »

I have also observed that when a socket being monitored by select is closed by another task before the select call reaches its timeout, the socket cannot be properly reclaimed. Is this also related to these two macro definitions?


Are you calling raw API functions from the other task? I found this at:

https://www.nongnu.org/lwip/2_1_x/multithreading.html

Quote
lwIP started targeting single-threaded environments. When adding multi- threading support, instead of making the core thread-safe, another approach was chosen: there is one main thread running the lwIP core (also known as the "tcpip_thread"). When running in a multithreaded environment, raw API functions MUST only be called from the core thread since raw API functions are not protected from concurrent access (aside from pbuf- and memory management functions). Application threads using the sequential- or socket API communicate with this main thread through message passing.
...

There's a lot more multithreading advice on that page.
« Last Edit: September 18, 2023, 05:16:30 pm by ledtester »
 
The following users thanked this post: tilblackout

Offline tilblackoutTopic starter

  • Contributor
  • Posts: 31
  • Country: cn
Re: LWIP may get stuck in the select function
« Reply #3 on: September 19, 2023, 03:17:47 am »
Thanks for your reply.

The SDK example I'm using is based on FreeRTOS+LwIP, so the macro definitions you mentioned are already enabled. Everything is running smoothly without executing 'select' in multiple places. Last night, I enabled the two suspicious macro definitions, LWIP_NETCONN_SEM_PER_THREAD and LWIP_NETCONN_FULLDUPLEX, that appeared in the select code, and everything ran fine throughout the night. I will continue to monitor it.

In fact, I noticed that the functions called by the raw API ultimately originate from the thread-safe file socket.h mentioned on this webpage. So, it seems that to implement multithreading, I might need to enable some other macro definitions.
« Last Edit: September 19, 2023, 03:20:21 am by tilblackout »
 

Offline peter-h

  • Super Contributor
  • ***
  • Posts: 3698
  • Country: gb
  • Doing electronics since the 1960s...
Re: LWIP may get stuck in the select function
« Reply #4 on: September 19, 2023, 12:28:09 pm »
I can't help you directly but a couple of years ago I looked at implementing WIFI and ETH in my FreeRTOS+LWIP product and while I never did the WIFI part, some people said there are various problems with dual interfaces in LWIP.

Maybe they were doing similar stuff to you, but nobody posted solutions (which is normal; most people don't post what they finally did).

LWIP is complicated, with so many options with subtle effects. Take a look at some of my own LWIP threads. Also development of LWIP (a ~16 year old project) came to a standstill some years ago so there is little support from anywhere; people have moved on. It is a very solid product so I am happy to use it as it is.

You may want to implement a mutex for this, but I could not suggest at what level.

One guy who knows a lot about LWIP is dare.
Z80 Z180 Z280 Z8 S8 8031 8051 H8/300 H8/500 80x86 90S1200 32F417
 
The following users thanked this post: tilblackout

Offline tilblackoutTopic starter

  • Contributor
  • Posts: 31
  • Country: cn
Re: LWIP may get stuck in the select function
« Reply #5 on: September 21, 2023, 08:00:45 am »
After setting the select as non reentrant using mutex, it is found today that when calling FreeRTOS vTaskDelay in the select function for timeout waiting, the program may be stuck in the socket() function of another network interface in another task.It seems that any other operation of LwIP will affect the select function and LWIP's support for dual network interface is not very good.

Now I'm directly canceling the timeout for select, and calling select at regular intervals. This has been running overnight, and there have been no issues on any of the three devices.
 

Offline peter-h

  • Super Contributor
  • ***
  • Posts: 3698
  • Country: gb
  • Doing electronics since the 1960s...
Re: LWIP may get stuck in the select function
« Reply #6 on: September 21, 2023, 09:45:42 am »
Quote
LWIP's support for dual network interface is not very good.

That's the consensus around the place.

If you solve it one day, please post the method used :)
Z80 Z180 Z280 Z8 S8 8031 8051 H8/300 H8/500 80x86 90S1200 32F417
 

Offline tellurium

  • Regular Contributor
  • *
  • Posts: 229
  • Country: ua
Re: LWIP may get stuck in the select function
« Reply #7 on: September 30, 2023, 10:37:46 am »
tilblackout - out of curiosity, what is your hardware?
Open source embedded network library https://mongoose.ws
TCP/IP stack + TLS1.3 + HTTP/WebSocket/MQTT in a single file
 

Offline tilblackoutTopic starter

  • Contributor
  • Posts: 31
  • Country: cn
Re: LWIP may get stuck in the select function
« Reply #8 on: October 19, 2023, 02:25:27 am »
tilblackout - out of curiosity, what is your hardware?

Sorry to reply so late.The CPU I use is the I.MX RT1176 chip from NXP
 

Offline tellurium

  • Regular Contributor
  • *
  • Posts: 229
  • Country: ua
Re: LWIP may get stuck in the select function
« Reply #9 on: October 19, 2023, 07:59:09 am »
Thank you.

What is that second PPP interface, is it a cellular modem? Do you do some sort of a gateway?

The reason I am asking is because my company develops a TCP/IP stack (e.g. we'll have a webinar next week with NXP about building WebUI on RT1020),  so I am curious about the use cases.
Open source embedded network library https://mongoose.ws
TCP/IP stack + TLS1.3 + HTTP/WebSocket/MQTT in a single file
 

Offline tilblackoutTopic starter

  • Contributor
  • Posts: 31
  • Country: cn
Re: LWIP may get stuck in the select function
« Reply #10 on: October 20, 2023, 07:03:17 am »
Thank you.

What is that second PPP interface, is it a cellular modem? Do you do some sort of a gateway?

The reason I am asking is because my company develops a TCP/IP stack (e.g. we'll have a webinar next week with NXP about building WebUI on RT1020),  so I am curious about the use cases.
I work on navigation-related products, and during the navigation process, we need to obtain data from NTRIP servers. It can correct the satellite signals received by GPS or GNSS receivers to enhance position accuracy.

While vehicles are in motion, it's typically not possible to have access to an external network through Ethernet cables or Wi-Fi. PPP dial-up is used to access the external network through an IoT card. It is used to retrieve differential correction data from the NTRIP server to correct errors in the positioning chip. The Ethernet port serves as one of the data storage options, and we also support other storage methods like 422/232/EMMC.

However, in practice, most of the customers I've encountered use a switch to connect via Ethernet cables to store this data. This requires us to transmit real-time location data through Ethernet (typically with the device acting as a TCP Server) to their computers. After testing, customers need to analyze this data.
 
The following users thanked this post: tellurium


Share me

Digg  Facebook  SlashDot  Delicious  Technorati  Twitter  Google  Yahoo
Smf