Author Topic: Cosmic rays!?  (Read 1241 times)

0 Members and 1 Guest are viewing this topic.

Offline jrmymllrTopic starter

  • Regular Contributor
  • *
  • Posts: 153
  • Country: us
Cosmic rays!?
« on: January 11, 2023, 09:34:46 pm »
Had something weird occur this morning. My home HVAC is controlled by a DIY LAN-connected thermostat. I wanted to access my thermostat remotely but didn't like Nest or any other commercial unit I could find that didn't cost half a thousand dollars, so I built my own with it's own built-in web server running off a (now EOL) ARM microcontroller.

The setpoint is supposed to bump up at 6am. I'm laying in bed shortly after 6am and notice the heat isn't running. So, being lazy and not wanting to wake my wife I grab my phone and bring up its webpage. It knows the correct time, but the reported ambient temperature is clearly from late last night. I try rebooting it from the webpage but it ignores that request.

I get out of bed and look at the actual thermostat on the wall and see it's stuck at 10:47pm and the colons aren't blinking. And yes, it thinks the temperature is warmer than it really is. Hmmm.

So I do what any engineer would do and bring up the source code for my thermostat. From its behavior, it appears the main loop quit running, but the interrupts are still functioning. I cycled power and it's been fine for the last 9 hours.

Now consider the hardware design is over 9 years old, the last firmware change was over 2 years ago, it had an uptime of almost 300 days and has had longer uptimes in the past, always resetting due to a power interruption. It's *never* done anything like this before. Hey, I overdesigned this thing and tested the $@*^ out of it before letting it run my heat and A/C! And to access it remotely, you need VPN.


Was it hit by a cosmic ray? Obviously nobody will ever know, but this seems like something rare and random like that. I'm just glad it didn't occur when away from home for days during the winter.
 

Offline niconiconi

  • Frequent Contributor
  • **
  • Posts: 366
  • Country: cn
Re: Cosmic rays!?
« Reply #1 on: January 11, 2023, 10:05:28 pm »
Critical control systems sometimes use an extremely defensive programming style, with redundant logic checks, redundant state variables, software checksum in all data structures in RAM, windowed watchdogs everywhere, all unused RAM and ROM filled with trap instructions, CPU runs and reruns periodic self-tests, etc, so that when it's hit by cosmic rays, the firmware bails out and reset as soon as possible to reduce the risk of lockup or failure.

Unfortunately standard programming languages and tools are poorly fitted for this task, there's little automation and the code is often unreadable.
« Last Edit: January 12, 2023, 07:46:50 pm by niconiconi »
 
The following users thanked this post: SeanB, tooki

Offline berke

  • Frequent Contributor
  • **
  • Posts: 259
  • Country: fr
  • F4WCO
Re: Cosmic rays!?
« Reply #2 on: January 11, 2023, 10:14:58 pm »
@jrmymllr let that teach you a lesson.  Next time don't be a penny pincher and use at least a triple redundant majority voting system with rad-hard CPUs for your HVAC lest you freeze in the morning.
 

Offline jrmymllrTopic starter

  • Regular Contributor
  • *
  • Posts: 153
  • Country: us
Re: Cosmic rays!?
« Reply #3 on: January 11, 2023, 10:15:55 pm »
Critical control systems sometimes use an extremely defense programming style, with redundant logic checks, redundant state variables, software checksum in all data structures in RAM, windowed watchdogs everywhere, all unused RAM and ROM filled with trap instructions, CPU runs and reruns periodic self-tests, etc, so that when it's hit by cosmic rays, the firmware bails out and reset as soon as possible to reduce the risk of lockup or failure.

Unfortunately standard programming languages and tools are poorly fitted for this task, there's little automation and the code is often unreadable.

My code doesn't go that far, but some changes will be made. Specifically, I will start using the watchdog timer which is something I should have implemented at the start.
 
The following users thanked this post: niconiconi

Offline niconiconi

  • Frequent Contributor
  • **
  • Posts: 366
  • Country: cn
Re: Cosmic rays!?
« Reply #4 on: January 11, 2023, 10:16:17 pm »
Was it hit by a cosmic ray? Obviously nobody will ever know, but this seems like something rare and random like that. I'm just glad it didn't occur when away from home for days during the winter.

Not necessarily cosmic, a burst of fast EMI transient can do the same thing...
 

Offline jrmymllrTopic starter

  • Regular Contributor
  • *
  • Posts: 153
  • Country: us
Re: Cosmic rays!?
« Reply #5 on: January 11, 2023, 10:19:36 pm »
@jrmymllr let that teach you a lesson.  Next time don't be a penny pincher and use at least a triple redundant majority voting system with rad-hard CPUs for your HVAC lest you freeze in the morning.

Ha ha. I'm far more worried if the heat didn't run in the dead of winter while we're away from home for a days at at time. That could cause some real damage with the low temperatures here.

Or the heat gets stuck on.
 

Online jpanhalt

  • Super Contributor
  • ***
  • Posts: 4005
  • Country: us
Re: Cosmic rays!?
« Reply #6 on: January 11, 2023, 10:25:35 pm »
Does your code have any blocking routines?  Do you have a watchdog timer reset or the equivalent?
 

Offline jrmymllrTopic starter

  • Regular Contributor
  • *
  • Posts: 153
  • Country: us
Re: Cosmic rays!?
« Reply #7 on: January 11, 2023, 11:10:56 pm »
Does your code have any blocking routines?  Do you have a watchdog timer reset or the equivalent?

I don't have it in front of me right now but in general, no blocking routines. And no I'm not using the WDT but it soon will. That would have helped in this particular situation.
 

Online Kleinstein

  • Super Contributor
  • ***
  • Posts: 15157
  • Country: de
Re: Cosmic rays!?
« Reply #8 on: January 12, 2023, 10:39:41 am »
It could be a hardware side effect like radiation (could be cosmic or an apha decay inside the chip - normal gamma from outside should not have an effect) or EMI or a brown out from the grid or similar, but there are also software bugs that can cause problems under rare conditions (e.g. memory leaks under rare conditions that add up, Y2000 like problems,  interference with interrupts coming up just at the wrong time). Modern hardware is normally quite reliable and normal chips are not likely to fail from radiation. Depending on the location there is a chance of a strong EMI effect from lightning.

The WDT is exactly made for such glitches that may happen very infrequently and thus hard to trace and exclude.
 

Offline SeanB

  • Super Contributor
  • ***
  • Posts: 16385
  • Country: za
Re: Cosmic rays!?
« Reply #9 on: January 12, 2023, 10:55:30 am »
Also remember the WDT should be running off the main loop, not via an interrupt that might still work code wise, it needs to be in the main routine, so that anything that sends the processor off into the twilight zone will stop resetting it, and it will implement the same restart. Also probably good to add to the main loop a power on check, to see if the reset was via mains fail, power on or WDT reset, probably by having a set of memory locations with a predefined value, that you check on power on, with reset from power on leaving them other than the magic value, and a WDT reset leaving them set to the magic. Then write a log, increment a software counter on the display, normally zero, but every WDT rest increments it, and then writes the magic values to RAM to set them anyway. Writing to display is the least damaging, as log writes use flash cycles, unless you also have a replaceable EEPROM that you use I2C to write to, and a separate device at a different address that handles volatile settings, or you have a battery backed RTC that you use the built in SRAM to add the counter to.
 

Offline jrmymllrTopic starter

  • Regular Contributor
  • *
  • Posts: 153
  • Country: us
Re: Cosmic rays!?
« Reply #10 on: January 12, 2023, 06:07:07 pm »
Also remember the WDT should be running off the main loop, not via an interrupt that might still work code wise, it needs to be in the main routine, so that anything that sends the processor off into the twilight zone will stop resetting it, and it will implement the same restart. Also probably good to add to the main loop a power on check, to see if the reset was via mains fail, power on or WDT reset, probably by having a set of memory locations with a predefined value, that you check on power on, with reset from power on leaving them other than the magic value, and a WDT reset leaving them set to the magic. Then write a log, increment a software counter on the display, normally zero, but every WDT rest increments it, and then writes the magic values to RAM to set them anyway. Writing to display is the least damaging, as log writes use flash cycles, unless you also have a replaceable EEPROM that you use I2C to write to, and a separate device at a different address that handles volatile settings, or you have a battery backed RTC that you use the built in SRAM to add the counter to.

Yes absolutely the WDT needs to get reset in the main loop, especially since it was the main loop that got trashed. Not to say it'll always happen like that.

I think just setting up the WDT will be sufficient. I should have did that from the start and I'm not sure why I didn't. There is no EEPROM or battery backed ram, just crappy Luminary/TI flash with a 100 write endurance that I'm already wear leveling and even at that, rarely writing to. It's taken this long for something to occur where WDT would have helped :)
 

Offline AlbertL

  • Regular Contributor
  • *
  • Posts: 219
  • Country: us
Re: Cosmic rays!?
« Reply #11 on: January 13, 2023, 02:12:32 pm »
I have a home-made home automation system built around a PLC.  When I added temperature sensors to the system, I noticed that the readings from the three indoor sensors (100 ohm platinum RTDs) instantaneously dropped by about 4 degrees F every day at exactly 6:00 AM, and began continuously fluctuating randomly in a range of several degrees F.  This continued until exactly 11:00 PM, when the readings immediately became accurate and stable, and remained that way until 6:00 AM.

There's nothing in my house that operates on that schedule, and I'm in a residential neighborhood with no industrial facilities or other potential sources of power line noise.  A power line filter didn't help.  Then I remembered that there's a 5kW AM broadcast station about 3/4 mile from my house.  I checked their schedule, and sure enough, their sign-on and sign-off times matched the start and stop times of the unstable readings.  After some experimenting, I found that the RF was getting into the PLC through its serial ports.  One of these is used for programming, and for convenience I'd been leaving it connected to my laptop all the time.  The other is used for an RS-485 Modbus RTU interface to other devices.  I unplugged the cable from the programming port and put an optical isolator on the Modbus port - problem solved!         
 
The following users thanked this post: jrmymllr


Share me

Digg  Facebook  SlashDot  Delicious  Technorati  Twitter  Google  Yahoo
Smf