Author Topic: watchdog handling in Multiprocessor environment  (Read 1954 times)

0 Members and 1 Guest are viewing this topic.

Offline rakeshm55Topic starter

  • Regular Contributor
  • *
  • Posts: 207
watchdog handling in Multiprocessor environment
« on: March 28, 2018, 05:59:24 am »
Hi,
I have 3 different processors running in a single system.
Two Non-OS processsors and One running on Linux OS


In this scenario How to handle watchdog??Should I run watchdogs available in all processors independently and have a single wired OR reset.
Or one single watch dog with One processor as master expecting "" I am OK"" from others on fail reset...
Is there any literature on how to handle watchdog in a multi processor system??
How do professional systems handle it??
 

Offline BravoV

  • Super Contributor
  • ***
  • Posts: 7547
  • Country: 00
  • +++ ATH1
Re: watchdog handling in Multiprocessor environment
« Reply #1 on: March 28, 2018, 06:16:37 am »
It depends, can't generalize it, and imo case by case, I think, not an expert, but logically, as you said its single system.

Human analogy, say one processor that is running non-os is like the human's heart and keeps running it's job, and the one running Linux OS is like the human brain.

Are you saying if there is physically bump/knock at the head (a watchdog event) will make the heart stop ? Is this design suitable for your system ?  :P

CMIIW
« Last Edit: March 28, 2018, 06:37:27 am by BravoV »
 
The following users thanked this post: rakeshm55

Offline David Chamberlain

  • Regular Contributor
  • *
  • Posts: 249
Re: watchdog handling in Multiprocessor environment
« Reply #2 on: March 28, 2018, 06:53:27 am »
The "Microcontrollers & FPGAs" sub on this forum has dozens of (us) people who can expand on this (consider moving the post)

But, you know Linux has watchdog as part of the kernel.
https://linux.die.net/man/8/watchdog

Apparently (because I've never used it) you can write a process to hit this watchdog otherwise the OS will do it for you.

So given you are on Linux I would approach it like this. Create a separate watchdog process, lets call it the Governor, that is responsible for resetting the watchdog and monitoring your other running processes.

How you monitor the other running processes from the Governor could be as simple as checking the last modified time on files that each of your processes are required to write to, or more complex with some sockets protocol to each process to ascertain their current health.

The Governor can then do other things aside from a hard system reset, for instance it could be more graceful and restart the dead service and perhaps reboot if that fails.

This keeps your watchdogging completely separate from your other applications.

What is the Wire OR? I think the Linux watchdog is run on a system timer like it would on any bear bones mcu application so should work the same, if you were worried about the OS locking up.

[Edit] what do you mean by "Two Non-OS processsors" are they running off the MCU/linux?
« Last Edit: March 28, 2018, 07:10:24 am by David Chamberlain »
 
The following users thanked this post: rakeshm55

Offline Ian.M

  • Super Contributor
  • ***
  • Posts: 12860
Re: watchdog handling in Multiprocessor environment
« Reply #3 on: March 28, 2018, 07:47:16 am »
I read that as a SoC or high end MCU running Linux, with two slave MCUs with non-OS based application specific firmware.  The reference to wire-ORing indicates the existence of a physical, externally accessible reset signal for each processor.

Whether or not all processors should reset together if any individual watchdog triggers is highly dependent on the application. 

If one of the slave MCUs fails, its probably preferable for the Linux SoC to get a high priority interrupt rather than a forced reset, as unless its filesystem is strictly read only + RAMdisk for session data, failing to do a clean shutdown risks file system corruption.  You also have the opportunity to log the error + critical system state data to aid in post-mortem analysis of the fault.  Such a slave processor failure interrupt should also be an input into the master's watchdog, so that if it fails to respond in a timely fashion, it also gets reset. 

For the slave MCUs, it depends on whether whatever they control could be at risk if they are reset.  If its possible to mitigate the risk by doing a controlled shutdown, then a similar high priority interrupt triggered strategy would be appropriate.

Whether or not any of the watchdogs should ever trigger a master reset also depends on whether the system as a whole can usefully continue with reduced functionality with one processor down.   You may also need to consider how to hold a failed processor in reset to lock it out if its watchdog is repeatedly triggering.

 

Offline rakeshm55Topic starter

  • Regular Contributor
  • *
  • Posts: 207
Re: watchdog handling in Multiprocessor environment
« Reply #4 on: March 28, 2018, 09:35:14 am »

There are three separate uCs in the System (actually 4 one for HMI).
OS runs in one of them rest has Non OS bare-metal codes .
One of the Non OS uC controls HMI lets call this Master uC. It is the master during power up and aids waking up of other two uCs (one non OS and  Other Linux)...

A WatchDog reset of this  master uC will be treated as a whole system wakeup (resets both Linux uC and non OS uC)....




 


Share me

Digg  Facebook  SlashDot  Delicious  Technorati  Twitter  Google  Yahoo
Smf