Author Topic: Is it considered normal to have to periodically reboot a Raspberry Pi? (Read 10489 times)

e100 · « **on:** February 26, 2022, 04:15:10 am »

I have a Pi that runs a bunch of programs including Node-red, Grafana, Influxdb, Mosquitto, VNC and my own Java program that runs a home automation system.
Generally this runs fine, albeit a bit slowly due to the amount of data being received and logged to SSD.

During testing I was rebooting every day and didn't see any problems.
Now I run the system continuously and every few weeks the OS kills a bunch of programs presumably because it has run out of memory or some other resource. From what I've seen on the web this is standard Linux behavior designed to prevent the system from completely grinding to a halt. This seems to be a bit different from Windows which tries to keep everything running until the bitter end where everything does grind to a halt.

For Linux is the only solution to pre-emptively reboot or kill/restart processes on a schedule so the OS never runs out of resources? I could restart things in the early hours of the morning when activity is at a minimum, but this seems like a drastic measure, and obviously this means there is a window of time where the system isn't doing the job it's supposed to be doing.

My example is relatively trivial, but how do important systems such as water, gas or electricity utilities keep their systems functional all the time? Do they have multiple systems running in parallel shadowing each other so they can periodically take one off line to reboot or restart its processes?

Ed.Kloonk · « **Reply #1 on:** February 26, 2022, 04:57:30 am »

To broadly answer your question, the short answer is that the Pi is in a different price class compared to critical industry components and software.

Is your problem hardware (interference) or software (bugs)? If you tell us a bit more we may be able to help you lock down what is going wrong.

Have you done any logging of resources to try and spot the culprit?

BradC · « **Reply #2 on:** February 26, 2022, 05:44:53 am »

Quote from: e100 on February 26, 2022, 04:15:10 am

For Linux is the only solution to pre-emptively reboot or kill/restart processes on a schedule so the OS never runs out of resources? I could restart things in the early hours of the morning when activity is at a minimum, but this seems like a drastic measure, and obviously this means there is a window of time where the system isn't doing the job it's supposed to be doing.

The solution is to find and fix the memory leak. I have pi's with hundreds of days uptime.

james_s · « **Reply #3 on:** February 26, 2022, 07:01:54 am »

I have 3 of them in service 24/7, one had an uptime of 280 days when I tripped over the power cord, it's currently up to 266 days. Another was rebooted about 4 months ago when I last updated Home Assistant. The third probably gets rebooted 2-3 times a year when Kodi gets updated. If you use a decent quality SD card and an adequate power supply a Pi should be very stable.

wilfred · « **Reply #4 on:** February 26, 2022, 07:44:18 am »

It doesn't sound like normal Linux behavior to me. And you say presumably but presuming won't get you anywhere. You need to try to gather some data. It may not even be a memory issue.

It could be faulty hardware or a marginal power supply or a flakey SD card. Maybe one of your applications does have a memory leak or some other bug. You need to do more detective work.

But Linux ought to be able to run without rebooting for ages.

Whales · « **Reply #5 on:** February 26, 2022, 07:44:36 am »

Quote from: e100 on February 26, 2022, 04:15:10 am

Now I run the system continuously and every few weeks the OS kills a bunch of programs presumably because it has run out of memory or some other resource. From what I've seen on the web this is standard Linux behavior designed to prevent the system from completely grinding to a halt. This seems to be a bit different from Windows which tries to keep everything running until the bitter end where everything does grind to a halt.

This is the "out of memory" or "OOM" killer. You can configure it if you want it to behave differently.

If you add swap space: the system will start "grinding to a halt" instead, as it tries to swap RAM to disk and does its best to keep going.

Overall however: you need to find out what your hog is. A misbehaving application can easily (try) to eat TB of RAM, so even the best system in the world won't handle it for long. It's impossible to write a fair OOM killer algorithm as applications can cheat by spawning lots of threads/processes or hogging more obscure/complex resources.

Nominal Animal · « **Reply #6 on:** February 26, 2022, 07:54:55 am »

It sounds like at least one of your userspace processes consumes increasing amounts of RAM, until the kernel Out-Of-Memory killer is triggered. This is expected and as-designed behaviour.

(Although the 'Pis are not the best SBCs out there, the only hardware issue I know of is of the "in certain, very specific circumstances, may lose data" sort, and not of the crashing sort. Running the same software on any similar SBC with the same amount of RAM, you'd have the exact same issue. If it was a power supply issue or similar, usually the Pi becomes completely unresponsive. The fact that only some processes are killed, is a telltale sign of the OOM killer. The kernel logs, dmesg or those in /var/log/, will tell exactly what happened.)

What you can do, is log e.g. ps axfuww | sort -grk 5 to a file once every hour or so (using a script in /etc/cron.hourly/), and find out which process that is; that command lists the biggest memory hogs first. (The logs will tell you which processes the OOM killer killed, but only logging all processes will tell us exactly why it happened. A single log is not useful either; you want to know which process actually grew continuously.)

When you find the process that grows overmuch, and you run it as a service, you can use say a daily script to restart that service, until you fix the memory leak.
Otherwise, you can use a daily script that uses e.g. kill -HUP $(pidof processname) or killall -HUP processname to send the memory-hogging process a hangup signal (-TERM and -KILL are alternatives; you can even use the HUP as a gentle reminder, TERM as a hard request, and KILL as no-questions-asked signals with a couple of seconds grace time in between), and then re-start the process.

This all depends on first determining which process or processes do leak memory, though.

For future reference:

You can set process limits using e.g. prlimit for each individual process (and their children) when you start them. Instead of
/path/to/binary arguments...
have your script run e.g.
/usr/bin/prlimit --as=64M:64M /path/to/binary arguments...
This means that the process cannot exceed the given limit. The as option limits the size of the address space in use by the process. If it tries to exceed that, syscalls like sbrk() and mmap() will fail; in practice, it means that memory allocation will fail.

Similarly, for processes that are less important than others, you can use nice -n level (process priority) and/or ionice -n level to reduce their priorities. For nice, the level is 0 (default) to 20 (least important), and for ionice, 0 (default) to 7 (least important). Just chain them like prlimit above.

You can even "manually" change the OOM score of the process (identified by its process ID number, and controlled by /proc/PID/oom_adj and /proc/PID/oom_score_adj, or /proc/PID/oom_score), so that if OOM occurs, the OOM killer targets or avoids (based on the adjustment or score) that particular process. See man 5 proc for the descriptions of these pseudofiles.

Many people often complain about the Linux OOM killer, without realizing that it simply applies a policy that us humans can trivially set in userspace. The default policy is a compromise between all sorts of use cases, intended to be "do the least harm in any situation", with a heuristic on how it targets the processes it kills. This has actually been a computer science research topic for a couple of decades at least, and nobody has found a better automatic heuristic on choosing what to kill; it sounds like a simple problem, but when you realize that different types of services and applications naturally use different amounts of resources, and their importance varies depending on what the use for the device is, it becomes intractable. The only sane solution is to use minimally invasive defaults, and let human administrators tune the settings as needed, if needed. Usually, it's not needed.

Process limits are a separate mechanism. You can limit even how much CPU time a process can stay running. Memory limits affect how much memory the process can request from the kernel. These all have a soft limit, and a hard limit. For example, when the process exceeds the soft CPU time limit by one second, it gets sent an SIGXCPU signal. If the process then exceeds the hard CPU time limit, it is killed via SIGKILL signal, which cannot be blocked or stopped at all.

Since different workloads are usually groups of processes and not single processes, you can utilize control groups AKA cgroups. For typical small appliance use cases, you don't need those; it's when you have multiple concurrent workloads that need different limits so they don't unduly affect each other, that cgroups become most useful.

Some Linux SBCs, but not Pis as far as I know, use ARM big.LITTLE architecture, where the same System-On-Chip has both "big"/fast CPU cores and "small"/slow CPU cores. For these, the taskset command can be used like the resource use limiting commands above, to specify which CPUs the process (and its child processes) can use. Combined with nice and ionice, this allows restricting a process or a set of processes to run only on the "small" (or "big"!) cores.

Whales · « **Reply #7 on:** February 26, 2022, 07:55:35 am »

In terms of SBC reliability: my biggest problems have been wifi related.

Raspberry pi Zero W with Alpine linux: wifi sometimes reconnects but then doesn't work. I think it might have come down to Alpine's udhcpd not getting a new ip lease. Not sure, worked around it by writing a watchdog that did a "service network restart" if things failed. Happened every few weeks.

GL-MT300N-V2 (yellow mini router) with Openwrt: wifi drivers sometimes failed spectacularly, dmesg showed all sorts of goof. Solution was writing a watchdog to reboot the units. Happened every few weeks.

Otherwise I have great luck with embedded linux and reliability. Some of my OpenWRT devices have run for over a year (the only downtime is for flashing updates); and they run in a much more constrained resource environment than your raspis, albeit all of their software is chosen for reliability in such environments.

BradC · « **Reply #8 on:** February 26, 2022, 08:19:33 am »

Not a raspberry pi, but mac mini :

brad@xxxxx:~$ uptime
05:37:43 up 1716 days, 3:07, 2 users, load average: 0.89, 0.36, 0.17

janoc · « **Reply #9 on:** February 26, 2022, 09:10:56 am »

Quote from: e100 on February 26, 2022, 04:15:10 am

I have a Pi that runs a bunch of programs including Node-red, Grafana, Influxdb, Mosquitto, VNC and my own Java program that runs a home automation system.
Generally this runs fine, albeit a bit slowly due to the amount of data being received and logged to SSD.
...
For Linux is the only solution to pre-emptively reboot or kill/restart processes on a schedule so the OS never runs out of resources? I could restart things in the early hours of the morning when activity is at a minimum, but this seems like a drastic measure, and obviously this means there is a window of time where the system isn't doing the job it's supposed to be doing.

I am pretty sure your code is leaking memory. Stuff like Grafana, Nodejs, InfluxDB etc. isn't designed to run on machines with low amount of RAM and that have to stay up for weeks/months. Leaking a megabyte of RAM here and there on a PC or a server instance somewhere in AWS pretty much doesn't matter - but will reliably make a small machine run out of memory. The bigger one would too, eventually, but they usually don't run for that long for this to become a problem these days ...

Also, throwing an exception on error and dying is a common error handling "pattern" - few if any of this software is written with any sort of resilience in mind.

Quote from: e100 on February 26, 2022, 04:15:10 am

My example is relatively trivial, but how do important systems such as water, gas or electricity utilities keep their systems functional all the time? Do they have multiple systems running in parallel shadowing each other so they can periodically take one off line to reboot or restart its processes?

Well, critical stuff that must stay up doesn't run nodejs and Grafana on a RaspberryPi, for starters. Such things can be used - but only for non-critical systems, such as analytics or some part of monitoring. That is what it has been designed for. So if your Grafana dashboard goes down, the CEO will be pissed because they can't see their reports but the water won't stop flowing and the country won't go into a blackout.

Concerning running "mission critical" stuff that must stay up, there are multiple strategies and not everything is always applicable. You need to evaluate your tradeoffs between how resilient the system you want to be and how much it will cost.

Common solutions in no particular order:

- Service monitoring with automatic restarts - e.g. systemd can be configured to monitor your service (check if a process exists) and restart it if it goes down. If you don't like systemd (which likely comes with your RPi distro), there are other such solutions too, even more sophisticated, e.g. Monit that can be configured to do rather complex monitoring - e.g. not only check whether a process is running but also whether the service is responding to requests and how quickly.

- Logging information, continuously analyzing the logs, identifying and fixing the reasons for the failures. Restarts are only a band-aid to get your out of trouble right now, they don't solve the underlying problem that could cause you to lose data if/when the service crashes, for example.

- Redundancy - multiple copies of the critical service with a load balancer/automatic failover. If one instance goes down, others will pick up slack transparently. Same can be done on hardware level - multiple servers, configured with dynamic failover so e.g. if the server itself or its network connectivity dies, the service remains running. This is often done with having the servers in geographically spread out locations, so that e.g. a power outage, flood or fire in one datacenter doesn't take out everything.

- Redundancy 2 - use multiple copies of the service simultaneously and constantly compare the results. If they don't match, eliminate the misbehaving copy. This is very common in safety-critical systems such as airplane control systems where you have 3 computers running the same* software in parallel and there is a voting/comparator system that checks their outputs. If two match and third doesn't, it is voted out and marked as failed. If you have only 2 computers, then if their outputs don't match, the system detects it and fails both, typically handing control to a human operator to deal with the problem.

Famously, B737 MAX did not work like this - they have only 2 computers because it is not a fly-by-wire plane, so even if the automation goes on a fritz completely the pilots can still control it and such high redundancy level isn't required. Each of these was fed data from one of the angle of attack sensors. Unfortunately, the system did not actually use both sensors simultaneously, constantly comparing them but only one has been used at a time. So there was essentially no redundancy at all.

In the previous versions of 737 this wouldn't have been a major problem (failed sensor would only trigger some alarms in the cockpit and some automation would stop working, the computers were controlling the "nice to have but not safety critical stuff") but in the MAX there was this new added system that actually affected controls of the plane and pushed it into a nosedive because of the bad data. That's what you get when you bolt on safety-critical functionality (without realizing) on something that hasn't been designed to be safety critical from the start ...

- Safe coding practices - if you don't allocate memory at runtime, you can't leak it either. This is why stuff like nodejs (or Python) is not something you will find in mission-critical applications because you have no control over how the application manages memory. See also: https://en.wikipedia.org/wiki/The_Power_of_10:_Rules_for_Developing_Safety-Critical_Code (used by JPL/NASA). There is also the concept of "defensive programming" - identify what could go wrong, evaluate the risk - and mitigate it right when you are implementing the functionality. It is much easier at that time vs having to deal with a behemoth of an application with millions of lines of code crashing due to invalid input somewhere.

- Architecture - split the application into a small, safety/mission critical core that will be very carefully written and tested/verified. All the optional, non-critical "fluff" like UIs should be separate processes/components that can be talking to the critical core using some well defined interface. The idea is to isolate one from the other so that your car or rocket don't blow up only because a control panel or infotainment display has crashed. Of course, the mission critical part needs to be able to identify and handle any erroneous inputs/data any such crashing UI or sensor could send it too, otherwise you are back to square one. Quite a few space missions were lost because the code was not handling invalid data correctly. Also the 737 MAX crashes started with exactly that - the on board computers being unable to identify and deal with invalid data from a failed critical sensor. Again, this goes back to the safe coding practices.

*I wrote "running the same software" but usually you don't run exactly identical copies. The idea is to have at least one "known good" older version running on one of the systems, so that in case your current release has introduced some kind of undetected bug, it gets caught. If you had all computers running the same faulty software, they would all fail in the same way with the same input (computers are deterministic) - and that could be pretty catastrophic. The Space Shuttle flight computers have been designed and used like this, for example. Also the Airbus fly-by-wire control computers are working like this.

Microdoser · « **Reply #10 on:** February 26, 2022, 10:11:04 am »

I have a similar issue with running some code on a Pi.

A workaround I am using is to use psutil: https://psutil.readthedocs.io/en/latest/

It allows you to check available memory against total memory and I have put in code to reboot the system when a threshold is reached. The Pi then autostarts the required application.

This should allow you to have the system running, albeit with occasional breaks in monitoring while the Pi reboots, while you find the memory leak.

Is the monitoring of such time sensitivity that a 30-60 second gap is critical?

DiTBho · « **Reply #11 on:** February 26, 2022, 10:43:47 am »

I understand you. Last summer I got an H3 SoC and it randomly crashed every 24h.

I then spent several months fixing defects of both firmware and kernel and encountered several errors there. For example, I found that u-boot was using the wrong timing, the kernel was initialized with the wrong device tree values, and the thermal management of the kernel was just plain buggy and not correctly functioning. I removed it, I fixed my DTS file and got both the serial ports and the two USB ports working, I re-implemented things in u-boot (and found there were also Endian-issues with the disk-partition), but one thing above all these mess up, there was also a stupid hardware bug: the cooling fan was connected directly after the LDO and this it was causing current glitches which caused abnormal behaviors in the SoC, which in my case cause catastrophic filesystem corruption.

There should be my topic here on EEVblog where I talked about that.

In conclusion, I spent just 30 euro for the board and four months to make it working. Linux on these cheap devices is like that: it makes you * want to do it * because it is cheap and you see that people who talk well about it on Youtube, but when you put their hands on it ... you have to beat yourself up a lot with wild hacking things.

emece67 · « **Reply #12 on:** February 26, 2022, 11:09:01 am »

Nominal Animal · « **Reply #13 on:** February 26, 2022, 12:08:54 pm »

Quote from: emece67 on February 26, 2022, 11:09:01 am

I'm in need of a list of "reliable enough" Linux SBCs.

Start with the SoC manufacturer. Broadcom and Allwinner aren't exactly forthcoming with their chipsets, so their Linux support is obviously on shaky ground from the get go. Even if you get a stable kernel version from somewhere, their support isn't in mainline kernels, so you'll be beholden to whoever forked the kernel for that version. Not good, in the long term.

Amlogic S805 and S905 (Meson family) has pretty good support in the mainline kernels; you're not beholden to the vendor for support. The features implemented on the SBCs are described by Device Tree, so you'll also want to find out if the exact board has Device Tree descriptions upstream. Although anyone can construct the device tree source (and compile to a .dtb), discovering the exact addresses and connectivity et cetera is quite a lot of work.
These are extremely common in Android-based TV boxes; I got one, H96 MAX X3, but its DT support in Linux isn't complete yet.

Samsung Exynos SoCs are also supported in mainline kernels, but DT descriptions exist really only for the most popular boards (Odroid XU4 and HC1/HC2, I'm not sure of others).

You can find the list of Device Tree descriptions for all 32-bit ARM-based Linux SBCs in the mainline kernels here, and for all 64-bit ARM here. The Device Tree Source format is human-readable; see the Odroid HC1 or say Libre Computer's La Frite for example. Note that a .dts file can include another; the common includes are usually named .dtsi.

One key feature you want, is that the SBC must be able to boot from either built-in Flash (where e.g. u-boot will reside), eMMC or dedicated boot Flash, or from external storage. You do not want to rely on microSD cards for booting, if reliability is something you want. (If you can find a good, reliable microSD and keep it mounted read-only, then it's okay; but it is darned hard today to determine which microSD cards are reliable in this kind of use in the long term.)

All SBCs with an USB 3 port can use SATA SSD disks via cheap USB-SATA bridges. Not all bridges are equal, but only connecting one and looking at the Manufacturer:Product number (hhhh:hhhh in hexadecimal) will tell for sure; just run lsusb . The manufacturers often change implementations without changing anything outwards visible, which frustrates us Linux folks when the earlier chip used in the USB-SATA bridge has public datasheets and is supported in Linux, but the new one doesn't. Very few SBCs, however, can load the bootloader off USB, so you definitely want some kind of boot flash on the SBC.

All this is just my suggestion – I have less than a dozen different SBCs – and others may disagree or have better anecdotal evidence.

rsjsouza · « **Reply #14 on:** February 26, 2022, 02:04:16 pm »

IME the Beaglebones, although not full fledged SBCs, work reliably over long uptimes. Sure, the majority of issues listed here are related to the applications themselves and their achilles' heels, so the foundation has to be solid as well (I got burnt by too many low level bugs in my lifetime so I now limit my options w.r.t. reliability).

And yes, Linux can be an expensive solution.

Siwastaja · « **Reply #15 on:** February 26, 2022, 02:16:21 pm »

The process:

For example using top, ps or lsof, look at which process hogs significant part of available system memory, CPU time, file descriptors, or other limited resources.

External software -> replace with something non-broken. Use well-known, well-tested good software.
Your own software -> debug and fix the memory leak.

Use auto-restarts and similar only when in absolute panic mode, and only temporarily. Otherwise, fix the bug. If it seems overwhelming, remove excess parts until the problem is simple. Home automation should not be too complicated. Don't throw more software at the problem; reduce instead.

DiTBho · « **Reply #16 on:** February 26, 2022, 02:50:46 pm »

@Nominal Animal
now I am with a MediaTek's SoC. Don't remember if I have ever talked about the "Moka project". It's a weird things of mine where the GPU Mali is disabled and you only work with the ARM cores of the MTK MT7622. Kernel >= v5.10 + BSP.

It's called "Moka" because it seems that you need a lot of coffee to make things working with these kind of SBCs.

We will see how it goes

Nominal Animal · « **Reply #17 on:** February 26, 2022, 03:26:09 pm »

Quote from: DiTBho on February 26, 2022, 02:50:46 pm

now I am with a MediaTek's SoC. Don't remember if I have ever talked about the "Moka project". It's a weird things of mine where the GPU Mali is disabled and you only work with the ARM cores of the MTK MT7622. Kernel >= v5.10 + BSP.

Sounds interesting. I was surprised to find that MT7622 is a dual-core ARM Cortex-A53, as MT7621A (in the Mikrotik RBM33G I have) is a MIPS 1004KEc.

I don't know what to really think of Mediatek SoCs. On one hand, it is used in many routers, some of which are quite stable with OpenWRT. On the other hand, Mediatek tends to just fork a kernel, stuff some kind of driver in there (with the code being... well, not something anyone would show in their portfolio, let's say), and that's it. It's better than Broadcom or Allwinner, as at least there is something one can read in order to write proper support drivers, but nothing like Amlogic, Rockchip, or Samsung efforts: they have actual developers submitting support patches to the mainline kernel, so that everyone using their SoCs benefit long-term.

Oh yeah: Rockchip is one of the very rare Chinese firms that do try to push the support of their SoCs into mainline kernels properly. Commendable. I'm seriously considering getting the Olimex RK3328 SOM and carrier, but as of now, have no real-world experience on the chips. (The kernels Olimex provides are not vanilla, but Rockchip's own, with patches and support that may not have yet been pushed to the mainline kernels.) I've just noted their paid developers' activities in the Linux kernel developer mailing lists; they have my respect thus far. (Olimex has a long-time commitment to OSHW, too.)

Quote from: DiTBho on February 26, 2022, 02:50:46 pm

It's called "Moka" because it seems that you need a lot of coffee to make things working with these kind of SBCs.

Whatever you do, do not check what it means in Finnish.

DiTBho · « **Reply #18 on:** February 26, 2022, 04:34:39 pm »

LOL, there are several ways to prepare coffee in the world:

Moka (the one shown in the attached pic)
Chemex (the bulgarian girl in my group is the only one who uses it)
Pour Over
French press
Soviet press
... (probably more)

I know because in my group of friends none of us use the same method.

Is it considered normal to have to periodically reboot a Raspberry Pi?

My preferred one is "Moka", also knows as "cafetera de fuego" (Spanish) and it's the name of a coffee maker whose etymology comes from "Mokha", a city of Yemen, in the Arabian peninsula.

DiTBho · « **Reply #19 on:** February 26, 2022, 04:50:17 pm »

Quote from: Nominal Animal on February 26, 2022, 03:26:09 pm

MT7621A (in the Mikrotik RBM33G I have) is a MIPS 1004KEc

I am still supporting the RB532A because it's used in a very complex project. However, I still don't have a good plan for Catalyst supporting MIPS32/LE, I mean ... I still only have support for MIPS32/BE.

No updated Stage3s. The last one is 2017. Not too bad, but it's 5 years old stuff and I seriously need to cook a new one.

Sooner or later I'll * copy * you and buy a Mikrotik RBM33G, so I'll have no excuse for being so lazy

Quote from: Nominal Animal on February 26, 2022, 03:26:09 pm

It's better than Broadcom or Allwinner, as at least there is something one can read in order to write proper support drivers, but nothing like Amlogic, Rockchip, or Samsung efforts

Yup, Mediatek seems several light years ahead than Allwinner. At least u-boot comes with the right settings and I don't have to spend time on it. The kernel support also looks better, even if ... on the miniPCIe there are a lot of quirks and the GPU is something that needs more work.

That's why I have disabled it for now, also because I seriously need the miniPCIe in perfect working order. There is a fun (and rather crazy) plan for it. (here I think you might guess what I mean

)

p.s.
According to the project plan, the entire rootfs should boot from SATA to CF. But we'll see how it goes.

e100 · « **Reply #20 on:** February 27, 2022, 05:28:04 am »

Thanks for all the advice. The Pi is running a 32 bit OS on a Pi4 with 8GB memory and is running from a USB connected SSD, not an SD card, so the first thing I did was to increase the size of the swap file from the default 0.1 GB to 2 GB. I'll keep an eye on the usage to see if it creeps up beyond the 0.1 GB it was using previously and will use that as my early warning sign of trouble ahead.

The first few lines of 'top' looks like this:

top - 13:13:42 up 32 min, 1 user, load average: 3.82, 3.21, 2.55
Tasks: 172 total, 1 running, 171 sleeping, 0 stopped, 0 zombie
%Cpu(s): 16.0 us, 9.4 sy, 0.0 ni, 70.9 id, 2.6 wa, 0.0 hi, 1.0 si, 0.0 st
MiB Mem : 7897.7 total, 4627.5 free, 1263.3 used, 2006.8 buff/cache
MiB Swap: 2048.0 total, 2048.0 free, 0.0 used. 6068.6 avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
563 influxdb 20 0 2487652 135944 31604 S 23.2 1.7 5:04.80 influxd
455 pi 20 0 229880 116548 25476 S 17.2 1.4 5:47.92 node-red
110 root 20 0 137256 88832 87836 S 12.3 1.1 3:57.45 systemd-journal
1274 pi 20 0 932848 11308 5440 S 9.3 0.1 2:47.63 nocand
753 pi 20 0 129728 49192 18148 S 8.9 0.6 2:52.87 Xvnc-core
1649 pi 20 0 1610660 653552 28304 S 7.9 8.1 8:18.24 java
1279 pi 20 0 802456 13588 5736 S 7.3 0.2 1:42.57 nocanc
2416 pi 20 0 1200664 111596 15308 S 7.0 1.4 2:40.28 java
421 root 20 0 25512 3388 2436 S 6.0 0.0 1:49.81 rsyslogd
556 mosquit+ 20 0 8900 5252 4696 S 2.6 0.1 0:51.85 mosquitto
91 root 20 0 0 0 0 S 1.0 0.0 0:20.24 jbd2/sda2-8
6138 pi 20 0 10428 2856 2464 R 1.0 0.0 0:00.05 top
919 pi 20 0 428528 33268 26032 S 0.7 0.4 0:16.87 lxpanel

Nominal Animal · « **Reply #21 on:** February 27, 2022, 07:06:34 am »

FWIW, influxd has a history of memory leaks. (Just do a search on "influxd" "memory leak" to see for yourself; re-occurs in different versions.)

This type of memory leak will most likely show up as slowly increasing VIRT size, i.e. larger and larger virtual memory address space being used, because it never returns any back to the OS. You can use top -o VIRT to sort the list so that processes with the largest VIRT are shown first.

Siwastaja · « **Reply #22 on:** February 27, 2022, 07:35:06 am »

This should be obvious, but you just can't have memory leaks in software supposed to run 24/7, server software, daemons, services, etc. It's just... total no-go. Even if they run in userspace, the same scrutiny as in kernel development is needed; in the end, it doesn't matter whether the kernel crashed or if an important userspace daemon crashed, the end result is non-functional system.

Memory leaks can be accepted in experimental or otherwise interesting desktop software where reliability or ability to run non-stop is uninteresting, because you can forgive the developers some mistakes, seeing they are human beings, too; that is, if the software is otherwise useful. But a database service just can't have any memory leaks. If it needs autorestarting on regular basis, I just can't see any use for it.

DiTBho · « **Reply #23 on:** February 27, 2022, 10:48:40 am »

Quote from: e100 on February 27, 2022, 05:28:04 am

The Pi is running a 32 bit OS on a Pi4 with 8GB memory and [..] the first thing I did was to increase the size of the swap file from the default 0.1 GB to 2 GB.

32bit userspace, 64bit kernel to address 8GB of ram, and do you also need swap?
that's worrying for me

DiTBho · « **Reply #24 on:** February 27, 2022, 10:54:37 am »

Quote from: Nominal Animal on February 27, 2022, 07:06:34 am

influxd has a history of memory leaks

is it the one written in Go-lang?


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

Author Topic: Is it considered normal to have to periodically reboot a Raspberry Pi? (Read 10489 times)

Share me