General > General Technical Chat

Cosmic rays!?

(1/3) > >>

jrmymllr:
Had something weird occur this morning. My home HVAC is controlled by a DIY LAN-connected thermostat. I wanted to access my thermostat remotely but didn't like Nest or any other commercial unit I could find that didn't cost half a thousand dollars, so I built my own with it's own built-in web server running off a (now EOL) ARM microcontroller.

The setpoint is supposed to bump up at 6am. I'm laying in bed shortly after 6am and notice the heat isn't running. So, being lazy and not wanting to wake my wife I grab my phone and bring up its webpage. It knows the correct time, but the reported ambient temperature is clearly from late last night. I try rebooting it from the webpage but it ignores that request.

I get out of bed and look at the actual thermostat on the wall and see it's stuck at 10:47pm and the colons aren't blinking. And yes, it thinks the temperature is warmer than it really is. Hmmm.

So I do what any engineer would do and bring up the source code for my thermostat. From its behavior, it appears the main loop quit running, but the interrupts are still functioning. I cycled power and it's been fine for the last 9 hours.

Now consider the hardware design is over 9 years old, the last firmware change was over 2 years ago, it had an uptime of almost 300 days and has had longer uptimes in the past, always resetting due to a power interruption. It's *never* done anything like this before. Hey, I overdesigned this thing and tested the $@*^ out of it before letting it run my heat and A/C! And to access it remotely, you need VPN.


Was it hit by a cosmic ray? Obviously nobody will ever know, but this seems like something rare and random like that. I'm just glad it didn't occur when away from home for days during the winter.

niconiconi:
Critical control systems sometimes use an extremely defensive programming style, with redundant logic checks, redundant state variables, software checksum in all data structures in RAM, windowed watchdogs everywhere, all unused RAM and ROM filled with trap instructions, CPU runs and reruns periodic self-tests, etc, so that when it's hit by cosmic rays, the firmware bails out and reset as soon as possible to reduce the risk of lockup or failure.

Unfortunately standard programming languages and tools are poorly fitted for this task, there's little automation and the code is often unreadable.

berke:
@jrmymllr let that teach you a lesson.  Next time don't be a penny pincher and use at least a triple redundant majority voting system with rad-hard CPUs for your HVAC lest you freeze in the morning.

jrmymllr:

--- Quote from: niconiconi on January 11, 2023, 10:05:28 pm ---Critical control systems sometimes use an extremely defense programming style, with redundant logic checks, redundant state variables, software checksum in all data structures in RAM, windowed watchdogs everywhere, all unused RAM and ROM filled with trap instructions, CPU runs and reruns periodic self-tests, etc, so that when it's hit by cosmic rays, the firmware bails out and reset as soon as possible to reduce the risk of lockup or failure.

Unfortunately standard programming languages and tools are poorly fitted for this task, there's little automation and the code is often unreadable.

--- End quote ---

My code doesn't go that far, but some changes will be made. Specifically, I will start using the watchdog timer which is something I should have implemented at the start.

niconiconi:

--- Quote from: jrmymllr on January 11, 2023, 09:34:46 pm ---Was it hit by a cosmic ray? Obviously nobody will ever know, but this seems like something rare and random like that. I'm just glad it didn't occur when away from home for days during the winter.

--- End quote ---

Not necessarily cosmic, a burst of fast EMI transient can do the same thing...

Navigation

[0] Message Index

[#] Next page

There was an error while thanking
Thanking...
Go to full version
Powered by SMFPacks Advanced Attachments Uploader Mod