Author Topic: CAN bus and others, how are really garbled transmissions protected against  (Read 793 times)

0 Members and 1 Guest are viewing this topic.

Offline InfravioletTopic starter

  • Super Contributor
  • ***
  • Posts: 1023
  • Country: gb
So, lets consider a CAN bus or another bus protocol running some big moving piece of equipment which can damage itself if it gets given the wrong command at the wrong time.

What is typically done in this sort of scenario to protect against garbled transmissions?

Putting checksum bytes of parity or error correction bits in to the transferred data can obviously eliminate most cases of slight corruption. BUT what if you get a situation where a message gets so corrupted that the received message ends up with the perfect storm of corruption in the message and the checksum such that it looks like a valid message again, except now the valid message is telling the receiving device to do something it really shouldn't do at this time?

Say a message on some arbtrary bus is sent like:
1,255,check_byte
meaning move yourself in direction 1 (fwds) at speed 255/255
but noise on the line(s) means what arrives is
0(corrupted),255,check_byte(also_corrupted)
and the corruption is the "perfect storm" that the corrupted checksum is valid for the corrupted message, now the thing is moving in direction 0 and about to tear itself apart because is should only move in direction 0 at half speed and at a different time when something else has gotten out of the way...

Is there a standard method used to protect against these perfect storm corruptions? I ask about CAN bus as being automotive youd expect any errors to have pretty damaging consequences, so those who make regular use of it would have a good motiviation to work out a way to handle those perfectly terrible errors. The same goes for communication methods used inside avionics systems, factory automation, controlling robotic arms...

Or for other buses in smaller systems such as I2C, you could also get situations where the corruption affects the addressing, so a valid message intended for one device, ends up being read by some other device on the bus, which interprets the same message in a wholely different way...

Do people developing those sort of systems have some way to actually eliminate the possibility of a corrupted transmissions being corrupted into looking like a valid message with a different meaning than that intended? Do they just work by procedures of sending the message several times before an action is taken? Do they just tolerate this possibility but drive the chance of it down to some once-in-a-million-years type threshold by using sufficiently many bits or bytes of anti-error coding per bit or byte of actual message data?

Thanks
 

Online ataradov

  • Super Contributor
  • ***
  • Posts: 11269
  • Country: us
    • Personal site
CAN also checks signal levels outside of the sampling points. Plus both receive and transmit side monitor the bus, and if TX side detects that the line level does not represent what it wants to transmit, it stops the transmission and intentionally generates an error signal.

And additional protection is provided by the fact that nobody relies on reliable delivery, so the messages are sent periodically. And the period is much smaller than a typical reaction time. So, even if by a miracle some sensor value gets damaged, it will be updated in the next few ms to the correct value. There are not a lot of components that will be able to react in that time.

I2C does something similar, although much less robust. When a master detects that the line level does not match what it wants to drive, it assumes the conflict and stops the transfer.
« Last Edit: October 27, 2023, 05:01:35 pm by ataradov »
Alex
 

Offline Siwastaja

  • Super Contributor
  • ***
  • Posts: 8179
  • Country: fi
In the end, it is managing possibilities. It is possible that infinite number of monkeys write perfect copies of all work by Shakespeare, and so it is possible that given infinite time, at some point random noise has generated a CAN packet which is valid (data, CRC, all), which instructs the lathe to kill its operator. It is just unlikely enough, and this is something which is not too hard to ballpark even on a napkin. If it requires billion of machines running for billion years, then you can maybe accept the risk.

CAN is resilient against this because it uses a pretty long CRC code compared to the packet payload length. It is practically impossible to get a garbled CAN message through the CRC check. A dangerous bug in CAN code (including the HW peripheral) is far more likely, like dozens of orders of magnitude, or say memory corruption; even with ECC.

It is important to understand absolutely nothing is zero-risk, and concentrate on highest-risk items, which would be, usually, in this order: human errors in software code, then human errors in hardware design, maybe then bit errors in non-checksummed RAM and bit errors in non-checksummed simple communication interfaces (think I2C), then bit errors in parity-checked RAM or interfaces. Strongly checksummed (even just 8-bit CRC) is much further.
« Last Edit: October 27, 2023, 06:07:27 pm by Siwastaja »
 

Offline pdenisowski

  • Frequent Contributor
  • **
  • Posts: 641
  • Country: us
  • Product Management Engineer, Rohde & Schwarz
    • Test and Measurement Fundamentals Playlist on the R&S YouTube channel

In a former life I spent a lot of time working with various communications protocols.

  • If you want to avoid errored messages that appear to be valid, you implement a robust (long) checksum.
  • If you want to avoid out of sequence or missing messages, you implement a sequence number (also protected by a checksum)

But you still can't - theoretically - drive down error probability to zero. 

I had a case once where a data link appeared to have almost entirely failed:  only about one in several million transmitted messages was being "properly" received.

When I put an analyzer on the link, I noticed that messages were being corrupted because the 3rd bit in every byte (including the checksum bytes) was always being set to zero during transmission due to a hardware fault. 

But every now and then either a packet would be normally have a zero as the 3rd bit of every byte (and survive transmission) OR -- much more disturbingly -- a packet would have that bit corrupted in every message byte and the corrupted checksum would be valid for the corrupted packet (!!!)

This was a very high speed data link (I think it was an OC-192 connection) and it took hours for it to occur, but anecdotal "proof" that things like that do happen in the real world.
Test and Measurement Fundamentals video series on the Rohde & Schwarz YouTube channel:  https://www.youtube.com/playlist?list=PLKxVoO5jUTlvsVtDcqrVn0ybqBVlLj2z8

Free online test and measurement fundamentals courses from Rohde & Schwarz:  https://tinyurl.com/mv7a4vb6
 

Offline zilp

  • Regular Contributor
  • *
  • Posts: 206
  • Country: de
The fundamental problem that you can't reduce the risk to zero has already been pointed out.

However, you *can* still calculate probabilities based on a noise model and properties of the redundancy mechanisms that you employ to detect (and potentially correct) transmission errors, and you can select those mechanisms to achieve a target failure rate.

As for CRCs in particular, this page can be helpful:

https://users.ece.cmu.edu/~koopman/crc/

If it really matters and you have the computational power, you can also use cryptographic hashes, that drives down the probability of accidental corruption to a level that all other sources of failures will dwarf the remaining risk. If you add a secret to use a MAC, you can even protect against intentional interference, which gets important when you have an untrusted communication medium (radio, shared networks, internet, ...).

Also mind you that you have to pay attention to protect everything that is semantically relevant, so stuff like message sequence completeness might be just as important as message integrity.
 

Offline pickle9000

  • Super Contributor
  • ***
  • Posts: 2439
  • Country: ca
There is also a life critical situation. Flood gates on a dam, Flight control systems. In those systems they often use redundant communication lines computers and even multiple programming teams to reduce errors.
 

Online ejeffrey

  • Super Contributor
  • ***
  • Posts: 3727
  • Country: us
A long checksum can make the probability of a random error being accepted arbitrarily low, for instance a 32 bit checksum or random data will only be accidentally correct one in 4 billion times.

Another part is chosing the right checksum for your error model.  Some checksums are weak against common errors such as transpositions, repetition, extra zero words at the end, or .  Some codes have error correcting capability: that's important is losing a message can also be dangerous. 

You also have to consider what happens if your link goes down completely.   Your checksums will hopefully ensure that line noise isn't misinterpreted as data, but losing messages can also be dangerous if the message is "emergency stop".  That's another reason CAN messages are often repeated constantly.  If a safety critical receiver doesn't receive a message in a certain amount of time it can go into fail safe mode.

Also pick you command set carefully.  Don't have a command to toggle some state (like a power button on a TV) because if you lose track of the current state you will do the wrong thing.  Try to have messages declare the desired state as completely as practical.  The goal is for the message to not depend on the previous message having been executed properly.
 


Share me

Digg  Facebook  SlashDot  Delicious  Technorati  Twitter  Google  Yahoo
Smf