Author Topic: LWMesh Config & Optimizations for 100+ Nodes (Read 8330 times)

Nick Novak · « **on:** November 04, 2016, 06:01:52 pm »

Good Day all.

I’m looking for advice regarding configuration and potential optimizations to LWM to support up to 100 sensor nodes.

First of all this is a great piece of software. I’m not a fan of bloat and other than this do exclusively bare metal applications but for this application and for the moment have embraced ASF. I’m sad to hear that it’s been orphaned.

My application is a linear deployment of vibration sensors in an outdoor environment. Sensors are ~6m apart but are pretty poor RF performers so I expect maybe 15 hops in a worst case scenario. The sensors report back to the coordinator (with Ack req) with 1packet every 10 seconds and asynchronously when alarms and status changes occur, generally very infrequently but several sensors (I’m currently testing 20 but likely less in real life) may conceivably try to report an alarm simultaneously.

All sensor nodes will be constantly powered and I’d like all nodes to be capable of routing so the installers won’t have to be aware of what’s happening under the hood.

Here is the configuration I’m currently using:

#define NWK_BUFFERS_AMOUNT 20
#define NWK_DUPLICATE_REJECTION_TABLE_SIZE 1024
#define NWK_DUPLICATE_REJECTION_TTL 1000
#define NWK_ROUTE_TABLE_SIZE 255
#define NWK_ROUTE_DEFAULT_SCORE 5
#define NWK_ACK_WAIT_TIME 500
#define NWK_GROUPS_AMOUNT 3
#define NWK_ROUTE_DISCOVERY_TABLE_SIZE 15
#define NWK_ROUTE_DISCOVERY_TIMEOUT 1000

#define NWK_ENABLE_ROUTING
#define NWK_ENABLE_ROUTE_DISCOVERY
#define NWK_ENABLE_SECURITY
#define NWK_ENABLE_SECURE_COMMANDS
#define SYS_SECURITY_MODE 0
#define NWK_ENABLE_MULTICAST

I’ve made a few tweeks to the duplicate rejection logic to increase its size like that.

Also coordinator and sensor nodes are nearly identical devices and run the same firmware, coordinator functions are switched on at run time when the hardware is initialized.

Anyway, there are two issues I’m struggling with:
The first is startup and network commissioning (all sensors are likely on a common power bus) I think I can solve this with sufficient randomization to stagger route discovery but I’m open to suggestions if anyone has a more organized way of ensuring an orderly startup.

The second is my multiple simultaneous alarm scenario. With perhaps 20 sensor nodes the network generally seems to fall flat on its face and goes into a tail spin for sometimes 10s of seconds and I’m struggling to understand why this is happening. A further bit of information – the sensor nodes are not yet installed, the PCBs are still on a panel so shouldn’t need to be doing any routing but are all in range and close proximity of each other.

So my first thought was duplicate sequence number table was filling – I don’t believe so now but haven’t added any high water mark trace yet.

I’ve tried AODV and natural routing – both seem pretty bumpy at startup and exhibit no difference in this scenario.

I suspected the proximity of the radios and -57dB LO leakage may be playing havoc with the radios (sorry, SAMR21) CCA detect so switched it carrier detect rather than energy.

Doing a little reading on the radio data sheet I decided to increase the CSMA back off exponents. I think this has provided a marginal improvement but still pretty far from acceptable.

Future options I’m considering:

Push back off times out as far as possible.

Prioritize routing and control endpoint traffic + implement my own CCA with full packet round trip back off time.

Possibly increase my bit rate to lower packet on air time.

Implement some kind of polled or time slotted scheme to organize traffic.

If anyone’s read this far I’m hoping you may have some useful suggestions or insights.

Nick.

ataradov · « **Reply #1 on:** November 05, 2016, 07:55:39 am »

Quote from: Nick Novak on November 04, 2016, 06:01:52 pm

1packet every 10 seconds

That's a bit stretching it. Imagine nodes close to the coordinator, they will have to generate their own messages plus pass around other's.

Quote from: Nick Novak on November 04, 2016, 06:01:52 pm

#define NWK_BUFFERS_AMOUNT 20

Probably excessive. I'd start with like 5-10.

Quote from: Nick Novak on November 04, 2016, 06:01:52 pm

#define NWK_DUPLICATE_REJECTION_TABLE_SIZE 1024

No need to be bigger than maximum number of devices in the network. And even that is overdoing it.

Quote from: Nick Novak on November 04, 2016, 06:01:52 pm

#define
#define NWK_ENABLE_MULTICAST

No real need for this.

Quote from: Nick Novak on November 04, 2016, 06:01:52 pm

I’ve made a few tweeks to the duplicate rejection logic to increase its size like that.

Why? There is no way it needs to be more than 100 entries ever.

Quote from: Nick Novak on November 04, 2016, 06:01:52 pm

I think I can solve this with sufficient randomization to stagger route discovery but I’m open to suggestions

That's a good way of doing it, it works.

Quote from: Nick Novak on November 04, 2016, 06:01:52 pm

I’m struggling to understand why this is happening.

Sniffer logs would help, but even if they are not doing any routing, they still rebroadcast all the requests and that will jam everything.

Quote from: Nick Novak on November 04, 2016, 06:01:52 pm

Implement some kind of polled or time slotted scheme to organize traffic.

This is your best bet, if it is possible in your application.

Nick Novak · « **Reply #2 on:** November 07, 2016, 10:09:30 pm »

Hi Alex,
Thanks for getting back to me. I’ve googled a few LWM related questions and have seen your name come up pretty often. You have my praise for supporting this thing solo and beyond your day job.

Thanks for clarifying that duplicate rejection thing, I was under the impression that entries stayed in there until they expire and weren’t replaced on an address / sequence # basis. I’ll scale that back and save some memory.

I’m planning to use multicast for some global coordinator to sensor node control, that’s why it’s in there but I’m not using it yet, only tested in a cut up copy of the WSN demo.

Speaking of sniffer logs, do you know of a something I can buy? I see some old threads talking about sniffers and protocol dissectors but haven’t found anything tangible.

A further bit of info for the benefit of anyone reading: changing the bitrate is probably not advisable. I did no investigation into the why but can confirm it doesn’t just work. PER was through the roof.

When you say: “but even if they are not doing any routing, they still rebroadcast all the requests and that will jam everything.” you mean when doing route discovery only correct? Almost all sensor nodes are going direct to the coordinator in my desktop test (a couple usually decide to do one hop) so I expect almost no rebroadcasting once route discovery is taken care of. Am I missing something?

Here is a bit of an update / brain dump: (forgive me if this looks like my unfinished homework, I hope to make a little better progress tomorrow)

Since Friday I’ve added some diagnostic messages so my sensor nodes will report some error counters. What I’m seeing is very few PHY_CHANNEL_ACCESS errors which dispels my idea that there is overwhelming congestion of the airwaves or problems with CCA misfiring due to proximity to other receivers. There is a pretty significant number of PHY_NO_ACK errors and many retransmits from my application. Could multiple application messages being received by the coordinator before ACKs are issued cause problems? I don’t think so, they should get queued up each in their own frame buffer right? ACK collisions should be as rare as any packet collisions with the radio’s built in CCA and TX_ARET features. So I am a little puzzled, I need to better understand the ACK mechanisms at the PHY level and the application layer I think. As I understand it each hop is ACKed by the PHY auto ack and the application layer ACK is a fully formed packet that gets routed back, probably on endpoint 0. Does that sound about right?

Or maybe my CPU is just spending all its time indexing through the duplicate rejection table (kidding).

Nick.

ataradov · « **Reply #3 on:** November 07, 2016, 10:21:46 pm »

Quote from: Nick Novak on November 07, 2016, 10:09:30 pm

Thanks for clarifying that duplicate rejection thing, I was under the impression that entries stayed in there until they expire and weren’t replaced on an address / sequence # basis. I’ll scale that back and save some memory.

They do stay until they expire. And if your expiration time is 1 second, then 100 entry table will give you 100 frames per second rate. This is about how much device can realistically receive under ideal conditions.

Quote from: Nick Novak on November 07, 2016, 10:09:30 pm

I’m planning to use multicast for some global coordinator to sensor node control, that’s why it’s in there but I’m not using it yet, only tested in a cut up copy of the WSN demo.

I would strongly advice against this. Muticast has no real advantages over plain broadcast, but can confuse things a lot.

Quote from: Nick Novak on November 07, 2016, 10:09:30 pm

Speaking of sniffer logs, do you know of a something I can buy? I see some old threads talking about sniffers and protocol dissectors but haven’t found anything tangible.

Atmel Sniffer lists a few devices that are reasonably cheap. There are also expensive options, like Ubiqua (https://www.ubilogix.com/products/ubiqua).

Quote from: Nick Novak on November 07, 2016, 10:09:30 pm

A further bit of info for the benefit of anyone reading: changing the bitrate is probably not advisable. I did no investigation into the why but can confirm it doesn’t just work. PER was through the roof.

With increased bitrate you loose LQI, and stack relies on that for route discovery. You can use some other metric, of course.

Quote from: Nick Novak on November 07, 2016, 10:09:30 pm

I expect almost no rebroadcasting once route discovery is taken care of.

As long as routes are discovered and stable. And when you stress the system beyond normal conditions, you will see increased route loss and discovery, which adds to the problem. All of this has to be tuned for large scale networks.

Quote from: Nick Novak on November 07, 2016, 10:09:30 pm

There is a pretty significant number of PHY_NO_ACK errors and many retransmits from my application.

That nodes trying to relay all those broadcast messages and not listening to incoming messages.

Quote from: Nick Novak on November 07, 2016, 10:09:30 pm

Could multiple application messages being received by the coordinator before ACKs are issued cause problems?

Not really.

Quote from: Nick Novak on November 07, 2016, 10:09:30 pm

ACK collisions should be as rare as any packet collisions with the radio’s built in CCA and TX_ARET features.

MAC ACKs are sent without CSMA/CA and retries.

Quote from: Nick Novak on November 07, 2016, 10:09:30 pm

As I understand it each hop is ACKed by the PHY auto ack and the application layer ACK is a fully formed packet that gets routed back, probably on endpoint 0. Does that sound about right?

That's exactly right.

Nick Novak · « **Reply #4 on:** November 10, 2016, 10:58:07 pm »

Ok protocol dissector is up and running. Very exciting, seldom do I have such a nice diagnostic tool, I’m already getting distracted with reading about adding custom protocol dissectors.

So since last post I’ve taken a step back and disabled routing and route discovery in my desktop scenario and things are performing better but not flawlessly like I’d expect. Wireshark has provided some insight though.

Here is an excerpt of a capture:

13079   2235.939376   0x4dee   0x0000   LwMesh   58   Lightweight Mesh, Nwk_Dst: 0x0000, Nwk_Src: 0x4dee, MIC SUCCESS
13080   2235.940376         IEEE 802.15.4   5   Ack
That one was a good packet sent without ACK request and corresponding MAC ACK frame

13081   2235.947377   0x4dee   0x0000   LwMesh   54   Lightweight Mesh, Nwk_Dst: 0x0000, Nwk_Src: 0x4dee, MIC SUCCESS
13082   2235.951377   0x4dee   0x0000   LwMesh   54   Lightweight Mesh, Nwk_Dst: 0x0000, Nwk_Src: 0x4dee, MIC SUCCESS
13083   2235.954377   0x0000   0x4dee   LwMesh   25   Lightweight Mesh, Nwk_Dst: 0x4dee, Nwk_Src: 0x0000, MIC SUCCESS
13084   2235.963378   0x4dee   0x0000   LwMesh   54   Lightweight Mesh, Nwk_Dst: 0x0000, Nwk_Src: 0x4dee, MIC SUCCESS
13085   2235.966378   0x0000   0x4dee   LwMesh   25   Lightweight Mesh, Nwk_Dst: 0x4dee, Nwk_Src: 0x0000, MIC SUCCESS
13086   2235.980380   0x0000   0x4dee   LwMesh   25   Lightweight Mesh, Nwk_Dst: 0x4dee, Nwk_Src: 0x0000, MIC SUCCESS
13087   2235.985380   0x4dee   0x0000   LwMesh   54   Lightweight Mesh, Nwk_Dst: 0x0000, Nwk_Src: 0x4dee, MIC SUCCESS
13088   2236.155397   0xb5e8   0x0000   LwMesh   58   Lightweight Mesh, Nwk_Dst: 0x0000, Nwk_Src: 0xb5e8, MIC SUCCESS
13089   2236.156398         IEEE 802.15.4   5   Ack
13090   2236.169399         IEEE 802.15.4   5   Ack

This illustrates my problem, I’m not sure if the MAC ACK frames are being clobbered by another transmission or what. After the first 2 tries the coordinator actually replies with the application layer ACK packet but I think it isn’t processed because the sender believes his packet wasn’t even sent. SO packets are being received and decoded but where are the MAC ACKs?
Finally 0x4DEE gives up and 0xB5E8 gets a chance, his packet is ACKed right away, then there is a second ACK frame in there and its sequence number doesn’t seem to match any recent packet transmission. Could my little ZIGBIT be missing packets?

My first thought is to move out the CSMA backoff period minimum but it will have wait as I'm already late for dinner. My wife will not be pleased with my progress.

Any thoughts Alex, while I’m considering my next experiment?

Nick

ataradov · « **Reply #5 on:** November 10, 2016, 11:15:03 pm »

Quote from: Nick Novak on November 10, 2016, 10:58:07 pm

Here is an excerpt of a capture:

Can you actually attach Wireshark capture files? Abbreviated decoding does not show much.

Quote from: Nick Novak on November 10, 2016, 10:58:07 pm

That one was a good packet sent without ACK request and corresponding MAC ACK frame

This looks like receiver could not receive the PHY ACK. This may be indication of some hardware issue, or may be not. I'd have to look at the detailed logs.

Although this does look like nodes are just talking over each other. This happens from time to time, and it is normal if this is rare.

Quote from: Nick Novak on November 10, 2016, 10:58:07 pm

My first thought is to move out the CSMA backoff

Leave all low level settings in their default state. Whoever selected them, did a good job, and they work for majority of applications.

Nick Novak · « **Reply #6 on:** November 11, 2016, 01:14:32 am »

Quote from: ataradov on November 10, 2016, 11:15:03 pm

Can you actually attach Wireshark capture files? Abbreviated decoding does not show much.

Of course but not until the morning. Was in a hurry and left it all at me desk.

Quote from: ataradov on November 10, 2016, 11:15:03 pm

Leave all low level settings in their default state. Whoever selected them, did a good job, and they work for majority of applications.

LOL. I'll try to remember that.

Off topic side question: do you know much about the ranging toolbox and why it has all but disappeared? Was the phase measurement actually accurate or at least stable in a multipath environment?

Nick Novak · « **Reply #7 on:** November 11, 2016, 01:55:44 am »

Quote from: ataradov on November 10, 2016, 11:15:03 pm

This looks like receiver could not receive the PHY ACK. This may be indication of some hardware issue, or may be not. I'd have to look at the detailed logs.

Anything is possible but I have a pretty minimalistic RF section: the balun and ceramic antenna used on the SAMR-xplained board and a small tuning network. Not much to go wrong. I haven't looked at output power or looked at the antenna impedance (it's all too small for me to solder) so I'm making the assumption that its working well enough My sensor board is only 14mm wide so not much counterpoise for the antenna which I think explains the mediocre RF performance compared to my first boards which had a full dipole.

I'll try to find time to do an output power comparison between my boards and the SAMR-xplained board.

ataradov · « **Reply #8 on:** November 11, 2016, 02:45:07 am »

Quote from: Nick Novak on November 11, 2016, 01:14:32 am

Off topic side question: do you know much about the ranging toolbox and why it has all but disappeared? Was the phase measurement actually accurate or at least stable in a multipath environment?

It was never really available in a first place. PMU itself is susceptible to all typical RF problems, no miracles there. But the actual distance measurement comes from results obtained over ~80 frequencies, so it eliminates most of the problems (after some math).

But I don't think there will be any new releases of the RTB, so I'd forget that thing exists.

Nick Novak · « **Reply #9 on:** November 11, 2016, 03:37:02 pm »

Hey Alex,
Here are a couple wireshark captures. The one with 21 sensors is a mess. The one with a single sensor is a bit more orderly but has some discrepancies like MAC ACK frames with no LWM frame, the sequence number in the MAC ACKs is increasing by 1 suggesting I just didn't see the LWM frame.

I can't see the ZIGBIT / wireshark setup missing packets at this low a rate so I'm puzzled by this. I've been getting pretty strong RSSI values between my boards given that they're all on my desk and less than a meter apart.

Perhaps I'll fire up that FCC approval TX test application as a sanity check. I also have a proof of concept application that was based on a cut up version of the WSN demo which streamed raw accelerometer data over the wireless. I'll dust that off as well and see what I get in wireshark.

I should add that in the single sensor capture I've removed all modifications I made to the radio registers, everything is running stock. The 21 sensor file is still running with larger CSMA back off exponents & 5 CCA retries. (It's a drag to reprogram them all)

Nick

ataradov · « **Reply #10 on:** November 11, 2016, 04:38:06 pm »

Quote from: Nick Novak on November 11, 2016, 01:55:44 am

Not much to go wrong.

Well, without tuning, I'd say there is 100% chance that performance is sub-optimal. I've never seen an RF board that works out of the box without tuning and matching.

ataradov · « **Reply #11 on:** November 11, 2016, 04:50:52 pm »

Quote from: Nick Novak on November 11, 2016, 03:37:02 pm

I can't see the ZIGBIT / wireshark setup missing packets at this low a rate so I'm puzzled by this. I've been getting pretty strong RSSI values between my boards given that they're all on my desk and less than a meter apart.

It is possible that signal is damaged beyond sniffers ability to receive frames. I've seen this before, when things are marginal.

The log for a single node looks reasonably clean, apart from obvious missing frames. The bigger log is harder to parse at this point.

What I would do if I was debugging this:
1. Simple performance test. Two devices. One streams data as fast as possible, without ACKs, another counts received frames and prints the count per second. You need to compare the number of sent and received frames.
2. Set default PHY settings and slowly ramp up the node count. Don't just enable all 21 of them, you will never figure out what's wrong. Start with 2-3 and increase one by one. Use WSNDemo for this and set reporting period equal to the number of nodes in seconds. So 5 nodes will report every 5 seconds. In this case logs should be absolutely clean of missing and damaged frames if hardware is working fine.

Nick Novak · « **Reply #12 on:** November 14, 2016, 10:19:32 pm »

So a brief update.

I’ve done a quick evaluation of my hardware and running the FCC test application on my boards VS the SAMR-xplained I see the same envelope but down significantly, about 20dB. That’s pretty bad so I’m going to see what I can do with my 0201 matching network however I don’t hold out a lot of hope for matching alone, I think it’s just destined to be a poor radiator, and that’s ok I only need 10 – 15m range. The good news is I don’t see any spurs or weirdness in the frequency domain. When compared to the xplained board the envelope is just attenuated so I’m going to declare my RF section functional, but sub optimal.

I ran my code on the SAMR-xplained and added it to my sensor network and was able to isolate an example where according to wireshark a packet was not acked by the xplained board but it was received because the xplained board replied. So I think that just happens, this is pretty great tool for the price so I’m not upset that the odd frame is missed, it is wireless after all and it's probably coping with lots of other in band crap at 2.4G (possibly originating from the computer its plugged into). I’m not going to dwell on single occurrences of this if network performance is there.

I’ve also revived my proof of concept code which was written probably 2 years ago and based on a hacked up version of the WSN demo (hijacking the data payload and adding a couple end points and other messages) and got it running on my current boards. My intent was to accomplish your two debugging recommendations Alex with minimal effort.

This POC code works significantly better. With 2 sensor nodes pushing 25 packets per second each and the coordinator sending 20 broadcasts per second my network is alive and well, packets are dropping (no RET) but really not many (sorry I can’t quantify that my assessment was based on the fluid flow of a strip chart data plot). Any way much better than before.
Ramping up to 20 nodes I find (with no staggered startup) power on route discovery and network commissioning takes maybe 3 seconds and the all nodes transmitting at once race condition is resolved with almost no perceptible delay.

As I feared, my problem appears to be between the chair and the keyboard.

I have my solution here I just need to find it. When I do I’ll post it, in case anyone else has made the same bonehead move as I have.

Nick

Nick Novak · « **Reply #13 on:** November 14, 2016, 10:25:03 pm »

Follow on question:
When running my POC code I see lots of packets with MIC failure in wire shark. I don't see this with my current (and broken) code. What is this indicating? MIC I thought was a CRC added by the PHY.

ataradov · « **Reply #14 on:** November 15, 2016, 05:14:37 pm »

Quote from: Nick Novak on November 14, 2016, 10:25:03 pm

When running my POC code I see lots of packets with MIC failure in wire shark. I don't see this with my current (and broken) code. What is this indicating? MIC I thought was a CRC added by the PHY.

MIC is the Message Integrity Code added to the secured frames, so it is likely that you don't use the same security key and one of them is not set up in Wireshark.

Nick Novak · « **Reply #15 on:** April 01, 2017, 01:33:37 pm »

So I meant to chime back in on this a while ago as I promised to post my resolution. In solving the mystery of why my proof of concept worked better than what I was developing from the ground up I realized that I was reusing code that compressed all of its signal processing into the largest possible bursts and was blocking the network stack for ~25 - 30ms at a time. It was written for a battery powered product so did all its processing in bursts and slept the rest. I rearranged it to process data in the smallest chunks possible as it arrives and the network situation improved significantly. So lesson learned - keep latency low or you will end up with many unnecessary RETs.

Since that time I've been making incremental gains and losses in performance but have still struggled in implementing a broadcast based firmware update. I've thought my way through the mechanics of it and it works well on the bench but falls on its face in the field. My field test consists of 40 nodes separated each by about 6m. Anyway the transfer degenerates into a storm of broadcasts and general nonsense. My application code eventually recovers itself by intervening and essentially resetting things. Recently I've spent some time understanding this nonsense broadcast storm and I'm finding what appears to be multiple in flight broadcast packets endlessly circulating. Looking at the duplicate rejection logic this seems very possible. To prove my theory I powered down the node indicated in my wireshark trace as the network source address and found the broadcasts continue at a very high rate, like they tend to crash the sniffer interface. The broadcasts went on for hours like this and all with one of three sequence numbers. Two days ago I changed a couple lines in the duplicate rejection function to keep all sequence numbers until they expire and performance has vastly improved. I haven't opened the champagne yet but it looks like I've programmed those field nodes manually for the last time. (You need to climb a ladder to get to each them, with a laptop in your hand, and its still snowy here)

Anyway I hoped to get your opinion on that last part Alex. I'll post what I changed when I'm back on my work computer but its pretty simple.

Nick

ataradov · « **Reply #16 on:** April 01, 2017, 04:57:44 pm »

Can you describe your change in more details? Because that's how DR table should work right now. The only way for entry to be removes is to expire.

Nick Novak · « **Reply #17 on:** April 01, 2017, 06:40:38 pm »

Truth be told Alex I don't quite understand what you're doing with the diff and mask and bitshifts. Stepping through the packet processing in one of my broadcast storms though I find I'm hitting the else case where the duplicate reject table entry is being updated with the new sequence number over and over, essentially flipping between the same two numbers. If I happen to get two back to back with the same sequence number the second gets rejected as it should but when they arrive in an alternating order they get rebroadcast infinitely. After seeing this happen I lobotomized the function as a test and I've had no reoccurrence, no overflows of my DR table and have run a couple firmware updates successfully.

Here is what I've replaced the DR logic with:

static bool nwkRxRejectDuplicate(NwkFrameHeader_t *header)
{
   NwkDuplicateRejectionEntry_t *entry;
   NwkDuplicateRejectionEntry_t *freeEntry = NULL;

   for (uint8_t i = 0; i < NWK_DUPLICATE_REJECTION_TABLE_SIZE; i++) {
      entry = &nwkRxDuplicateRejectionTable;

      if (0 == entry->ttl) {
         freeEntry = entry;
         continue;
      }

      if (header->nwkSrcAddr == entry->src) { //network source match

         if(entry->seq == header->nwkSeq){ //network sequence match - reject
            entry->ttl = DUPLICATE_REJECTION_TTL;
            return true;
         }
         //no else, keep looking and make a new entry at the end
      }
   }//end table index loop

   if (NULL == freeEntry) {   //no free entries - reject
      return true;
   }

   freeEntry->src = header->nwkSrcAddr;
   freeEntry->seq = header->nwkSeq;
   freeEntry->ttl = DUPLICATE_REJECTION_TTL;

   SYS_TimerStart(&nwkRxDuplicateRejectionTimer);

   return false;
}

ataradov · « **Reply #18 on:** April 01, 2017, 07:01:18 pm »

The idea behind the mask is to allow up to 8 frames with sequential sequence numbers to be tracked. It is not the most beautiful solution, but it was necessary, since it is possible for two sequential frames from the same node to arrive in a different order. This happens, for example, when security is used for one frame, but not the other, and extra security processing time introduces a delay in arrival. Similar thing can happen due to routing as well.

My solution has only one entry per source address, yours creates as many entries as there are ongoing transfers (broadcast or not).

I will need to spend some time with the original code to figure out if there is some mistake there. The idea is solid, it was used in many cases, but implementation may have issues.

Your solution is probably sufficient enough for most practical cases. I made mine in the dark, before I could actually look at practical examples, so I assumed the worst.

And your solution does not care about sequence numbers incrementing, while mine does. I can't tell if it is good or bad off the top of my head

Anyway, that's exactly the point of LwMesh - take it as a base and hack for your specific use case.

friendly_giant · « **Reply #19 on:** April 04, 2017, 09:18:20 pm »

I came across this post and was interested as I am experiencing similar issues with broadcast storms.

In looking to implement a similar solution I noticed in your code that you are not accessing different entries in the duplicateRejectionTable. You have here:
entry = &nwkRxDuplicateRejectionTable;

Is there a reason you are doing this or should it be:
entry = &nwkRxDuplicateRejectionTable;

Hope to have similar success in solving this as you have!

ataradov · « **Reply #20 on:** April 04, 2017, 09:20:24 pm »

Quote from: friendly_giant on April 04, 2017, 09:18:20 pm

I came across this post and was interested as I am experiencing similar issues with broadcast storms.

In looking to implement a similar solution I noticed in your code that you are not accessing different entries in the duplicateRejectionTable. You have here:
entry = &nwkRxDuplicateRejectionTable;

Is there a reason you are doing this or should it be:
entry = &nwkRxDuplicateRejectionTable;

Hope to have similar success in solving this as you have!

And that's why you should use code tags:

Code: [Select]

entry = &nwkRxDuplicateRejectionTable[i];

Nick Novak · « **Reply #21 on:** June 05, 2017, 10:02:03 pm »

Hey Alex, I want to run something by you if I may.

I have a growing field sensor deployment about 60 nodes currently and for the most part working well. I am still having some difficulty with network stability though and what I notice is that nodes are not always routing things sensibly. For example a node may establish or at least hang on to a route that’s at the edge of its RF reach with crap LQI when there is a node 6 meters away with great forward LQI. Maybe the route was better when it was formed but the node isn’t going to do anything about it until it breaks and often they tend to hang on indefinitely, requiring more RETs than they should. In other cases the LQI is good but there is a senseless number of hops involved which is just adding tons of congestion.

I had some thoughts about optimizing both of these scenarios since my application relies on communication between the sensor nodes and coordinator and link local broadcasts between nodes for sensor fusion type information sharing.
The idea is this:
•   Coordinator periodically sends out a "routing beacon" broadcast w/ link local
•   Receiving nodes compute forward LQI, add 1 to the hop count, random time delay and rebroadcast their current route home w/ link local
•   The beacon message contains a sequence number so it is only resent once by each node and propagates outward from the coordinator
•   Nodes tabulate all the received beacon messages and rank the best route home by LQI and fewest hops.
•   When a new route is selected a message is sent to the next hop address telling that node to update its routing table, that node forwards the message to its next hop address and so on until it filters all the way back to the coordinator. This establishes the return route from the coordinator back to the end node.
•   Based on testing – routing beacon message now includes the entire route so that nodes can ensure they don’t form a loop. This is going to limit the possible number of hops but I don’t think this will be an issue.

So I wonder if this sounds reasonable to you. It seems like I’m reinventing the route discovery function of the stack but I can’t see a better option. I see this optimization routine coexisting with the built in route discovery. On initialization I’ll rely on that and if a route breaks it will fall back on the traditional discovery but as the optimization runs through a few iterations I expect things to get sorted out nicely.

The mapping of the return route is necessary correct? I don’t see the routing tables being updated outside of the route discovery process so I presume if I'm changing something I need to patch up the return path as well.

I’m in the midst of debugging this scheme currently and it’s taken longer than expected. It is showing promising results but currently is changing routes too frequently and generating too much traffic. I need to add some complexity to my ranking scheme and maybe play with the beacon frequency.

I welcome your comments.

Nick

Oh yeah, that last post had me going for a second. I read the post and didn’t understand your reply right away. My thought process: “Holy crap he’s right. Wait, how is that even possible”… as I’m opening the project to check “Wait, that’s totally impossible that I made that mistake” Then I understood your reply.

ataradov · « **Reply #22 on:** June 05, 2017, 10:14:35 pm »

Quote from: Nick Novak on June 05, 2017, 10:02:03 pm

For example a node may establish or at least hang on to a route that’s at the edge of its RF reach with crap LQI when there is a node 6 meters away with great forward LQI. Maybe the route was better when it was formed but the node isn’t going to do anything about it until it breaks and often they tend to hang on indefinitely, requiring more RETs than they should. In other cases the LQI is good but there is a senseless number of hops involved which is just adding tons of congestion.

Yeah, that's a downside of simple route discovery algorithms.

Quote from: Nick Novak on June 05, 2017, 10:02:03 pm

So I wonder if this sounds reasonable to you.

Absolutely! What you have invented is tree routing, and it is a well-known technique for networks with small numbers of data sinks (1 in your case).

You may actually go a bit further and implement source routing. Make each node add its address to the list of addresses. This way end node can evaluate entire route hop-by hop, and pick absolutely the best route. You then can include the whole route inside the message (this causes reduction of useful payload, of course). This way intermediate nodes don't need to have any routing tables at all, they just take routing information from the frame itself.

Device may even remember multiple different routes and try them as reserve. and you are guaranteed sufficiently different path, since you see the hole history.

Quote from: Nick Novak on June 05, 2017, 10:02:03 pm

It seems like I’m reinventing the route discovery function of the stack but I can’t see a better option.

Nope, that's the intention of LwMesh - give you a framework for building application-specific networks.

Quote from: Nick Novak on June 05, 2017, 10:02:03 pm

I see this optimization routine coexisting with the built in route discovery.

I would not. If you have only one gateway and data only flows between the gateway and end nodes, then this algorithm is good enough on its own.

Quote from: Nick Novak on June 05, 2017, 10:02:03 pm

The mapping of the return route is necessary correct? I don’t see the routing tables being updated outside of the route discovery process so I presume if I'm changing something I need to patch up the return path as well.

Yes, you need to maintain both directions.

In case of source routing, device will send the full route to the gateway, so it can just use the same table in reverse.

Nick Novak · « **Reply #23 on:** June 10, 2017, 04:04:29 am »

Thanks for the assistance once again Alex and the vote of confidence. Much appreciated.

Embedding the route tree into the frame is an interesting idea but all the routing table infrastructure is built up already so I’ll probably use it. Plus the extra time on the air for all the packets.

My new / old / reinvented routing scheme is working well, currently in concert with the preexisting route discovery logic as a failover, however I plan to take your advice and replace it entirely as network congestion is a lot more predictable and manageable. I have 62 nodes at the moment spread out over about 10 acres of space and things are finally, truly, running smoothly. I pushed an OTA firmware update out this afternoon with my custom updater and all of my sensor nodes picked up all or 99% of the image during the broadcast pass. None dropped out into route discovery which typically derailed the process in the past.

Now, 38 more nodes and we can "Ship it!"

I think it will end up being pretty application specific and spread out but I’ll see about posting some source code as its finalized.

I’ve mentioned to my boss that my company should send you a bottle of scotch or something in appreciation as we pay lots more for worse support of many products. I’m pretty partial to Lagavulin, if you have a preference let me know. I think you can send me an email through the forum?? That’s not a promise but I think I may be able make it happen.

Cheers,

Nick

ataradov · « **Reply #24 on:** June 10, 2017, 04:28:34 am »

Quote from: Nick Novak on June 10, 2017, 04:04:29 am

I’ve mentioned to my boss that my company should send you a bottle of scotch or something in appreciation as we pay lots more for worse support of many products. I’m pretty partial to Lagavulin, if you have a preference let me know. I think you can send me an email through the forum?? That’s not a promise but I think I may be able make it happen.

Thanks for the offer, but I'll pass. I don't want to create some sort of conflict of interest here.


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

Author Topic: LWMesh Config & Optimizations for 100+ Nodes (Read 8330 times)

Share me