Author Topic: Can one find out how much data can be sent to a NON blocking socket without loss (Read 2828 times)

peter-h · « **on:** December 21, 2022, 07:45:51 pm »

This is LWIP.

I have code which reads a serial port buffer, finds out how many bytes are in it, and transmits that to a socket.

Unfortunately the socket is non-blocking so if the data is arriving too fast, some gets lost.

The obvious solution (make the socket blocking) would cause problems elsewhere.

Is there some way to get how many bytes a write socket can accept?

For reading from a socket, there appears to be a hack where you give it a zero block length, but I can't find anything for writing to a socket.

Code: [Select]

          if ((rx_len > 0) )
          {
            //read rx_len bytes from the buffer and send over ethernet
            rx_len = serial_receive(i, buf, rx_len);
            if (rx_len > 0)
            {
              if (write(ethser_port[i].client_fd, buf, rx_len) <0)
              {
                //something went wrong. Close the client socket

                dbg_printf("EthSer Closing client idx: %d fd: %d", i, ethser_port[i].client_fd);

                close(ethser_port[i].client_fd);
                ethser_port[i].client_fd = -1;
                continue;  //break from the for-loop early
              }

              ethser_port[i].rx_count += rx_len;

            }
          }

where write() is

Googling suggests there may be a way using something like this

Code: [Select]

//socket ready for writing
        if(FD_ISSET(new_sd, &write_flags)) {
            //printf("\nSocket ready for write");
            FD_CLR(new_sd, &write_flags);
            send(new_sd, out, 255, 0);
            memset(&out, 0, 255);
        }   //end if

but that isn't going to tell me the socket's available tx buffer space.

Nominal Animal · « **Reply #1 on:** December 21, 2022, 10:25:40 pm »

Who told you lwip_write() will never return a short count? They lied.

The return value will tell you how much of the data was buffered (if there was no error). You can discard that amount, but not all you tried to send.

peter-h · « **Reply #2 on:** December 21, 2022, 10:54:21 pm »

That function ends up in

Code: [Select]

int
lwip_write(int s, const void *data, size_t size)
{
  return lwip_send(s, data, size, 0);
}

Surprisingly (this function is all over the place) I can't find a definition what this returns but from what you say it sounds like it returns #bytes actually written and the rest it discarded. Is that right?

I may be able to work with that, with a bit of a hack, because the "discarded" data has already been read out of the UART rx buffer so I can't just dump it; I have to call lwip_write again and repeat until it has all gone.

But a much better way would be some kind of a queue_space() function for lwip_send(). I would call it before reading the UART rx buffer and read out of that buffer the lesser figure.

This code is not mine and I am trying to make it work with correct flow control. TCP does flow control already but in this case the code is loading data into a nonblocking socket without first checking how much space there is.

OTOH, a breakpoint in the code checking for a return value < 0 is taken so lwip_send is returning a negative value.

Nominal Animal · « **Reply #3 on:** December 21, 2022, 11:37:24 pm »

lwip_send() is implemented in src/api/sockets.c:lwip_send().
If this is a TCP socket, then netconn_write_partly() will be called.
If this is a non-TCP socket, then lwip_sendto() (defined later in the same file) will handle the write, and if it cannot be sent as a single packet, an ERR_MEM error will be returned.

(This means that if you use a non-TCP socket, and your write() returns ERR_MEM, you should try halving the buffer size until it succeeds or you get a buffer size that is not worth halving, that is better sent in a single packet. Say 64 bytes or so.)

src/api/api_lib.c:netconn_write_partly() calls netconn_write_vectors_partly() (defined later in the same file), and will modify the number of bytes written, reflecting the number of bytes consumed from the start of the buffer, and either an error, or this amount, will eventually be returned by the original call.

Quote from: peter-h on December 21, 2022, 10:54:21 pm

But a much better way would be some kind of a queue_space() function for lwip_send(). I would call it before reading the UART rx buffer and read out of that buffer the lesser figure.

Unfortunately, the internal structure of LWIP is such that the amount of free space available in the transmit buffer is not visible at this level. The netconn_write_vectors_partly() only finds out how much data was successfully sent by examining the structure modified by the src/api/api_lib.c:netconn_apimsg() call, which basically calls the "lower part" of the API function. It is this lower part that determines how much data can be buffered (for a TCP socket here in this case; the non-TCP socket case allocates a temporary buffer dynamically).

I don't see many sensible options here, besides the obvious "use a dedicated UART rx buffer, and only read out as much as there is free room in that buffer".

ejeffrey · « **Reply #4 on:** December 22, 2022, 12:43:32 am »

Quote from: peter-h on December 21, 2022, 10:54:21 pm

Surprisingly (this function is all over the place) I can't find a definition what this returns but from what you say it sounds like it returns #bytes actually written and the rest it discarded. Is that right?

It's not "discarded". It's still in your buffer, it just wasn't transfered to the network buffers. You might be discarding it by ignoring the return count and deallocating or reusing the buffer but that is you not the send function.

Quote

I may be able to work with that, with a bit of a hack, because the "discarded" data has already been read out of the UART rx buffer so I can't just dump it; I have to call lwip_write again and repeat until it has all gone.

But a much better way would be some kind of a queue_space() function for lwip_send(). I would call it before reading the UART rx buffer and read out of that buffer the lesser figure.

No, your way is much worse. It is inherently not reentrant or thread safe. It depends on the precise internal architecture (and I don't have a ton of experience with lwip, so this will be more general) but it is usually not possible to guarantee that if your proposed queue_space function returns N, that a subsequent call to send will actually be able to write N bytes. What if a network interrupt happens and fills up the buffers with received data before you get around to actually sending the data?

You would have to check the return code either way, so checking the buffer capacity before sending is at best an optimization and likely prone to misuse.

The only way to make your proposed API work is to have the queue_space actually reserve buffers for a future send call. Then an intervening receive interrupt will just have to discard it's data since the buffers are already allocated. But then if you never end up sending the data, that buffer just sits there reserved doing nothing.

The way you need to handle this is that you read a block of data from the serial port, the try resending it repeatedly, checking the return value each time and incrementing the start offset. That's basically what the blocking call does, but the non blocking call gives you the opportunity to give up at any point.

peter-h · « **Reply #5 on:** December 22, 2022, 09:52:11 am »

Working on the last sentence above, does this make sense?

Code: [Select]


          if ((rx_len > 0) && (force_rx == true))
          {
            //read rx_len bytes from the buffer and send over ethernet
            rx_len = serial_receive(i, buf, rx_len);
            if (rx_len > 0)
            {
              int tmp_len;
              bool fail = false;
              // This returns # bytes actually written (if a positive value)
              tmp_len = write(ethser_port[i].client_fd, buf, rx_len);
              if (tmp_len<0) fail=true;

              // Process the case where not all rx_len bytes were written (by the NON blocking socket)
              // This can cause a wait here if the other end is slow in consuming the data.
              if ( (tmp_len<rx_len) && !fail )
              {
            	  int offset = 0;
             	  do
            	  {
            		  rx_len -= tmp_len;
            		  offset += tmp_len;
            		  tmp_len = write(ethser_port[i].client_fd, &buf[offset], rx_len);
            		  if (tmp_len<0) fail=true;
            	  }
            	  while ( ( tmp_len>0 ) && !fail );
              }

              if ( fail )
              {
                 //something went wrong. Close the client socket
                 dbg_printf("EthSer Closing client idx: %d fd: %d", i, ethser_port[i].client_fd);
                 close(ethser_port[i].client_fd);
                 ethser_port[i].client_fd = -1;
                 continue;  //break from the for-loop early
              }

             }
          }

This should work well enough for the job, despite the possible hold-up if the destination has a bottleneck.

The max buffer size written is 512 bytes, and that number is actually written most/all of the time.

Unfortunately it has not fixed the problem. The very first call to write() returns -1, which is strange. This happens after ~35000 bytes have passed through, and the extra code (where not a complete buffer got written) is never executed during that time.

Nominal Animal · « **Reply #6 on:** December 22, 2022, 06:23:10 pm »

Quote from: peter-h on December 22, 2022, 09:52:11 am

The very first call to write() returns -1, which is strange.

Is it a TCP or an UDP socket? Which version of LWIP? (These are crucial to fully investigate the possible call chain.)

peter-h · « **Reply #7 on:** December 22, 2022, 08:02:03 pm »

TCP socket. LWIP V2.0.3.

My code above is wrong. I am just re-doing it. But it is still true that the "short write" is never occurring, which is surprising because the "UART" I am feeding this socket from is a USB VCP which runs pretty fast.

Nominal Animal · « **Reply #8 on:** December 22, 2022, 08:37:36 pm »

Thanks. Have you checked errno to see the exact error code? (If you use the default, LWIP_SOCKET_SET_ERRNO==1, LWIP declares int errno; for you, and sets it whenever the socket calls fail.)

It is extremely likely the call fails in src/api/api_lib.c:netconn_write_partly() called by src/api/sockets.c:lwip_send().

If errno == ERR_VAL, there is previously written data in the TCP transmit buffer that hasn't been sent yet, and the send timeout hasn't fired yet. Because the send timeout might fire in the mean time, the way LWIP uses to track the number of bytes sent could fail (return the sum of both writes, or zero), so netconn_write_partly() gives up in this case. In English, "There is pending data in the TCP transmit buffer so I cannot send more data right now, please try again a bit later." (And the correct POSIX error code would be EWOULDBLOCK or EAGAIN.)

Otherwise, the error is returned by the lower half call, netconn_apimsg(lwip_netconn_do_write,msg), but I bet the failure is the ERR_VAL case above.

The more I read LWIP sources and understand its internal architecture, the less I like it. Then again, I haven't designed an embedded TCP/IP stack myself, so who am I to judge?

peter-h · « **Reply #9 on:** December 22, 2022, 09:16:31 pm »

This is my latest code. Just fixed some silly mistakes in the code handling a short write

Code: [Select]

          if ((rx_len > 0) && (force_rx == true))
          {
            //read rx_len bytes from the serial port buffer
            rx_len = KDE_serial_receive(i, buf, rx_len);

            //  and send them over ethernet
            if (rx_len > 0)
            {
              //dbg_printf("rx_len: %d", rx_len); //Debug to show how (well) the buffering works

              int written;
              bool fail = false;

              // This returns # bytes actually written (if a positive value)
              written = write(ethser_port[i].client_fd, buf, rx_len);
              if (written<0)
              {
            	  fail=true;	// error condition
              }
              else
              {
            	  ethser_port[i].rx_count += written;
              }

              // Process the case where not all rx_len bytes were written (by the NON blocking socket)
              // This can cause a wait here if the other end is slow in consuming the data.
              if ( (written<rx_len) && !fail )
              {
            	  int offset = 0;
             	  do
            	  {
            		  rx_len -= written;
            		  offset += written;
            		  written = write(ethser_port[i].client_fd, &buf[offset], rx_len);
            		  if (written<0)
            		  {
            			  fail=true;	// error condition
            		  }
            		  else
            		  {
            			  ethser_port[i].rx_count += rx_len;
            		  }
            	  }
            	  while ( ( written<rx_len ) && !fail );
              }
/*
 * This code crashes the task somehow
              if ( fail )
              {
                 //something went wrong. Close the client socket
                 dbg_printf("EthSer Closing client idx: %d fd: %d", i, ethser_port[i].client_fd);
                 close(ethser_port[i].client_fd);
                 ethser_port[i].client_fd = -1;
                 continue;  //break from the for-loop early
              }
*/
            }
          }

I can't find errnoto.
LWIP_SOCKET_SET_ERRNO is defined.

The only error code I ever see is

/** Out of memory error. */
ERR_MEM = -1,

and I am currently ignoring it. I just retry.

which is interesting; this stuff is defined in lwipopts.h, and I have spent many weeks exploring the mostly undocumented values in there, but if there is a bottleneck downstream, LWIP buffers will always overflow eventually, yet TCP/IP is supposed to do flow control (and it does do it in other code I have written e.g. the HTTP server and the file transfer functions there (a PC->box file transfer, writing to a 30kbyte/sec FLASH FS would not work at all without flow control, obviously).

Quote

The more I read LWIP sources and understand its internal architecture, the less I like it. Then again, I haven't designed an embedded TCP/IP stack myself, so who am I to judge?

The problem is that when trying to develop a box with an RJ45 in it, there will be man-years of code there no matter how you shake it

And if you don't have man-years, and especially if you don't have the expertise, then you have to use a library... and whose? LWIP has been out for 15 years and runs solidly in this product.

Unfortunately, as usual with these libraries which I have not written myself, I am on the limit of my understanding. The only other bit of internet code I did was the HTTP server but that used the netconn API (which lies underneath the sockets API), and an HTTP server is transaction based so is easier. This code I didn't write myself.

Nominal Animal · « **Reply #10 on:** December 22, 2022, 09:50:03 pm »

Quote from: peter-h on December 22, 2022, 09:16:31 pm

I can't find errnoto.

I, uh, missed a space in there. It's 'errno'.

The return value of write() is either positive, or -1 to indicate an error. (It should never return zero.) When it returns -1, examine errno to see what caused it. My bet is that errno == ERR_VAL, per my previous message.

This is completely untested, but here's how I'd write the code, when you have rx_len > 0 in buf:

Code: [Select]

    // Also omitted: check that ethser_port[i].client_fd != -1, before rx_len = KDE_serial_receive(i, buf, rx_len);

    size_t  tx_len = 0;

    while (tx_len < rx_len) {
        ssize_t written;

        errno = 0;
        written = write(ethser_port[i].client_fd, buf + tx_len, rx_len - tx_len);
        if (written > 0) {
            tx_len += n;
            continue;
        } else
        if (errno == ERR_VAL) {
            // yield()? sleep()? For one half of TCP send timeout interval?
            continue;
        } else {
            // Something went wrong.  'errno' should contain the error code.
            shutdown(ethser_port[i].client_fd, SHUT_RDWR);
            close(ethser_port[i].client_fd);
            ethser_port[i].client_fd = -1;
            break;
        }
    }

peter-h · « **Reply #11 on:** December 22, 2022, 10:33:32 pm »

The return value of write() I am seeing is always one of

1 - if I am sending keystrokes
512 - if I am sending big blocks (and then the last block will be < 512 although I have not actually checked this)
-1 - if there is an error, and this corresponds to "too much data too fast"

Your code if of course much neater than mine

I didn't realise that one could do

buf + tx_len

being same as

&buf[tx_len]

I knew buf is same as &buf[0]. Learn something every day

I am now testing with various size blocks and with various bottlenecks at the far end (various output UART baud rates). I am getting interesting results. With 1200 baud output, and USB VCP input of 25kbytes/sec, and that's a real proper output bottleneck, I can send ~7k without data loss.

7k is in the right ballpark for the various buffers in the system. 1k on each UART, 512 bytes to service the sockets, a few k in LWIP. Yet this means that flow control must be working over the LAN. So this is more complicated. But I am happy to document this, as a system intended for half duplex applications (fairly essential anyway since a server cannot initiate a connection, so the master will be at the client end), and packets no bigger than a few k.

Nominal Animal · « **Reply #12 on:** December 23, 2022, 12:24:48 am »

Do the devices connected to the UARTs have hardware flow control, RTS/CTS? This is necessary for not dropping any data in your scenario.

The trick of managing CTS over a long pipe is to initially pass the buffer size, and then report whenever additional bytes have been written.

For example, if we consider data flow A→B, and A knows B has 512 byte buffer, then:
A: Here is 500 bytes.
(A knows B has room for 12 additional bytes.)
B: Received 500 bytes.
B: Wrote 100.
(A knows B has room for 112 additional bytes.)
A: Send 52 bytes.
(A knows B has room for 60 additional bytes.)
B: Received 52 bytes.
B: Wrote 200.
(A knows B has room for 260 additional bytes.)
This way A knows exactly when B can receive more data. If the device connected to A supports hardware flow control, then A keeps CTS asserted whenever it has room in its UART receive buffer.

In a real-world implementation, the problem is that we need a separate metadata channel to report how many bytes have been written. One could use TCP urgent data (say 16-bit network-endian wrote count) –– the "received N bytes" being inherent in the protocol ––, but LWIP does not support urgent data currently.

peter-h · « **Reply #13 on:** December 23, 2022, 09:33:38 am »

Yes; that aspect I am aware of (serial comms has been my day job for 40 years).

However, what I did was a software hack at the RH end, whereby the data is just dumped, so the output UART baud rate is not limiting the flow. In that situation, no data should be lost, because flow control should work along the preceeding sections.

I am finding that write() returns -1 sometimes. This is very rare and happens after ~50 512-byte packets. Predictably, some data is then lost.

There was that code which, upon getting -1, closed the socket and attempted to clean up a bit, and new incoming data would re-open the connection, but that didn't work, so I just took it out.

Curiously, I am never seeing short writes to the socket.

On the other end I see no error condition ever

Code: [Select]

    //read data from TCP connection for transmit
          int tx_len = serial_get_opqspace(i);
          if (tx_len > sizeof(buf)) tx_len = sizeof(buf);

          //tx_len contains the maximum number of characters the UART buffer can hold. IOW
          //tx_len has the maximum number of characters we can read from the socket.
          if (tx_len > 0)
            {
            tx_len = read(ethser_port[i].client_fd, buf, tx_len);

            //Now tx_len is either >0 when data has arrived, 0 when the remote side
            //closed the socket or -1 to indicate no data or an error.
            if (tx_len > 0)
              {
              //we have data, send it to the UART
              serial_transmit(i, buf, tx_len);
              ethser_port[i].tx_count += tx_len;
              }

            //Check socket closed or an error occured (except for when the socket
            //return no data in which case errno is set to EWOULDBLOCK).
            if ((tx_len <= 0) && (errno != EWOULDBLOCK))
              {
              //Something went wrong or remote closed. Close the client socket

              dbg_printf("EthSer Closing client idx: %d fd: %d", i, ethser_port[i].client_fd);

              close(ethser_port[i].client_fd);
              ethser_port[i].client_fd = -1;
              continue;  //break from the for-loop early; no need to continue with this socket
              }
            }

Just had an idea. Maybe some buffer size is marginal. I will have a play with lwipopts.h. EDIT: tried a load, makes no real difference.

I think, enough time spent on this. I am going to document this as a half duplex system, max packet size 1k unless baud rates chosen to throttle data appropriately. In reality most applications are half duplex; this is not a 1980s-style terminal server

Quite weird though. Even more weird that the short writes are never seen. The full 512 bytes are always written, or -1.

Nominal Animal · « **Reply #14 on:** December 23, 2022, 06:34:38 pm »

Quote from: peter-h on December 23, 2022, 09:33:38 am

Yes; that aspect I am aware of (serial comms has been my day job for 40 years).

Don't be offended; I just cannot help but mention things that could be useful for others as well. There have been several threads about serial port forwarding, so I just wanted everyone else possibly reading this thread to understand the possible issues for buffer overruns here. I always write this way, and it absolutely should not be taken as if I believed you didn't know: I write these things so that everyone else is aware of these things as well.

Quote from: peter-h on December 23, 2022, 09:33:38 am

However, what I did was a software hack at the RH end, whereby the data is just dumped, so the output UART baud rate is not limiting the flow. In that situation, no data should be lost, because flow control should work along the preceeding sections.

The multiple sequential buffers and the latencies inherent in flushing a buffer does cause a very interesting phenomena, though.
Even when you have more than enough bandwidth, you can still get congestion due to the latencies, if you have multiple sequential buffers.

There is a good traffic analogy: consider an occasional slightly longer latency, that causes the traffic to stop, just like they stop at street lights.
In theory, the entire queue could move as one when the light switches to green, but that's not usually what happens: the cars accelerate individually, with the next one starting to accelerate only when the one in front has pulled a certain distance away.

This buffering-latency-inchworm-effect (I bet network people have a better name for this!) is excarberated by the number of buffers you have in sequence.
You can have data flowing very well for quite some time, and then suddenly stuff just backs up, and takes a relatively long time to regain the previous throughput.

Like frogs jumping in a queue: if only they jumped in sync, they'd make much faster progress. Their overall speed (data throughput) ends up being determined by their jump phases (buffer fill-flush latencies)!

Quote from: peter-h on December 23, 2022, 09:33:38 am

Curiously, I am never seeing short writes to the socket.

If all downstream buffers are a multiple of 512 bytes in size, and the TCP MTU is large enough to never split 512 byte packets, then this is to be expected: either the buffer is full, or it has room for a multiple of 512 bytes, all down the buffer chain.

peter-h · « **Reply #15 on:** December 23, 2022, 07:19:21 pm »

Quote

Don't be offended

Never offended by anything "technical" - it's all a great learning experience

In an RTOS environment, the way buffers move along is likely to be quite random, or maybe sometimes not...

Quote

If all downstream buffers are a multiple of 512 bytes in size, and the TCP MTU is large enough to never split 512 byte packets, then this is to be expected: either the buffer is full, or it has room for a multiple of 512 bytes, all down the buffer chain.

That's really interesting. Yes, there are two 512 byte buffers, but in addition there are a few k MTU-sized buffers (1500 + a bit). I tried it with the 512 byte buffers much bigger than the MTU and still didn't see a short write. But this could be for any reason at all. TCP/IP is incredibly complex.

Nominal Animal · « **Reply #16 on:** December 23, 2022, 08:25:56 pm »

(Just wanted to be sure. To me, these discussions are like talking shop while standing in a hallway or having a coffee/tea/soda, with lots of interested faces following the conversation, while not actively participating. In real life, I'm the type who observes their nonverbal cues, and when abbreviations/phenomena/algorithms/etc. are discussed that those others seem to fail to follow, I stop and describe/explain things to everyone; and make sure everybody is on the same track. Online, it doesn't work that well –– no cues ––, and sometimes the other persons take my explanations/descriptions as if I thought they might not know that, while it's never that: it's just that I want everybody following the discussion to be kept along. Solving a problem for just one person isn't that interesting or useful, really: it is when you help with the problem solving procedure, possibly introducing new concepts and ways of solving similar problems, alternate causes for such problems, that others might encounter later on and stumble on to the recorded discussion, that makes it worthwhile to spend as much time and effort on the things as I do. I am not very "clever" myself, I just love trying to help solve problems, and do it mostly via brute force effort.

)

Quote from: peter-h on December 23, 2022, 07:19:21 pm

there are two 512 byte buffers, but in addition there are a few k MTU-sized buffers (1500 + a bit). I tried it with the 512 byte buffers much bigger than the MTU and still didn't see a short write. But this could be for any reason at all. TCP/IP is incredibly complex.

Yep, and I'm not at all familiar with how LWIP handles TCP buffering; the upper/lower function dispatch is making it hard to track the exact call chain.

I know a Berkeley/POSIX -type socket interface does require indirect function dispatch, but I prefer function pointers myself. With something like Elixir Cross Referencer or even plain grep -e member -R . they are easier to follow than what LWIP uses. (If you wonder what that does, https://elixir.bootlin.com/ exposes the Linux kernel sources using it. I use it all the time to trace stuff through the Linux kernel. One can install it locally; if you use a httpd server on Linux that only serves on loopback (127.x.y.z), it will not be externally accessible.)

But just because I would do things differently, and dislike the way LWIP does things, does not mean I wouldn't use LWIP myself: it is just a tool, after all, and one with significant resources and real-world testing behind it. I was just grumbling...

peter-h · « **Reply #17 on:** December 23, 2022, 10:34:02 pm »

Actually almost nobody understands how LWIP manages its buffers etc versus the config options in the lwipopts.h file. Much is online but ambiguous unless you already know the answer. I spent days changing various options and seeing how much RAM got used up and where, and documented them as best as I could.

But as far as free code goes, there probably isn't anything better.

An interesting point relating to this is whether LWIP (or how much of it) is zero-copy. Obviously on the way out (to ETH) it can't be because the least it needs to do to your supplied buffer is to attach the headers etc to it, and in reality in needs to split it if > MTU. On the way in it could be zero-copy but then the biggest buffer it would give you would be MTU sized.

Nominal Animal · « **Reply #18 on:** December 24, 2022, 12:44:57 am »

Quote from: peter-h on December 23, 2022, 10:34:02 pm

But as far as free code goes, there probably isn't anything better.

Quite possible.

I have looked at FNET, because vjmuzik has an Arduino library (FNET fork and NativeEthernet) for use with Teensy, which is more to my liking, but I haven't really used it in anger either. Teensy 4.1 is the only MCU I currently have with a 10/100 Ethernet on it. (I do have several different SBCs running Linux, and I can do zero-copy network I/O in Linux using memory-mapped tx and rx buffers, but that's different: the kernel does all the hard, complex stuff there.)

Quote from: peter-h on December 23, 2022, 10:34:02 pm

An interesting point relating to this is whether LWIP (or how much of it) is zero-copy.

True.

I've done lots of MPI stuff, and particularly like the asynchronous I/O interface. It has very similar requirements as zero-copy, in that one must not modify the data buffer until the async operation completes. (Most implementations use a dedicated I/O thread handling the transfers in non-blocking mode.)

Compared to the socket interface, a zero-copy would really need a completely different interface, one where a write/send either takes a callback or closure, or returns a token, so that the buffer is retained until the operation completes (by the IP stack doing the callback, or updating the token state). Similarly, a read/receive should really be event-based, with a similar token or closure to tell the IP stack the data is no longer needed.

If I had to support zero-copy on Berkeley/POSIX sockets -type interface, I'd cry: it really isn't suitable for the task.

At minimum, I'd like to separate the header part and the payload part. Although they would be contiguous in received messages, having them separate in the API, and separate when sending messages –– especially if you could do a scatter-send with multiple recipients, the stack duplicating the data internally as needed –– would make a lot of sense. (The completion tracking would then be per header, not per data.)

Unfortunately, even in MPI, using MPI_Isend() and MPI_Irecv() for "zero-copy"/nonblocking/asynchronous I/O, seem to be extremely hard for many programmers to understand. I've even had heated arguments with "MPI Experts" who claim that using these is inherently dangerous (because they just didn't understand how to use them properly). It isn't, and is the only way to allow a HPC distributed simulator/calculator to both compute and communicate simultaneously, not wasting time.

At the core, they behave just like zero-copy async sends and receives: the call returns immediately, but the data is read from the buffer or written to the buffer at some point in the future, and one needs to check the state of the request object to determine whether it succeeded or not. (I don't like that, I'd much prefer to have a callback/closure/event instead.) So, having dealt with the confusion, I'm not at all surprised that zero-copy I/O interfaces are "hard", when even something as stable and widely used as MPI confuses the "experts".

In IP stacks, the layered OSI model confuses full-stack developers even more, because it takes a lot of experience and a robust personality to understand that such models are abstract, and do not need to –– should not –– reflect the actual API or implementation. (Saying that out aloud among network software developers would normally start a shouting match, too.. I got some really unpopular views! But I do try to explain what my views and opinions are based on, so that one can check if they have a reason to agree or disagree.)

peter-h · « **Reply #19 on:** December 24, 2022, 01:46:19 pm »

This thing is more complicated because the interface between LWIP and the CPU's ETH controller is just moving data packets, to which the TCP (UDP?) headers and tails are added by the ETH controller.

The interface software could just pass a pointer to a list of packet pointers, but my version does actually copy over the data. There is a later zero-copy version which I haven't implemented because I don't need it, there is zero support, and debugging this stuff is almost impossible.


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

Author Topic: Can one find out how much data can be sent to a NON blocking socket without loss (Read 2828 times)

Share me