It sounds like at least one of your userspace processes consumes increasing amounts of RAM, until the kernel Out-Of-Memory killer is triggered. This is expected and as-designed behaviour.
(Although the 'Pis are not the best SBCs out there, the only hardware issue I know of is of the
"in certain, very specific circumstances, may lose data" sort, and not of the crashing sort. Running the same software on any similar SBC with the same amount of RAM, you'd have the exact same issue. If it was a power supply issue or similar, usually the Pi becomes completely unresponsive. The fact that only some processes are killed, is a telltale sign of the OOM killer. The kernel logs,
dmesg or those in
/var/log/, will tell exactly what happened.)
What you can do, is log e.g.
ps axfuww | sort -grk 5 to a file once every hour or so (using a script in
/etc/cron.hourly/), and find out which process that is; that command lists the biggest memory hogs first. (The logs will tell you which processes the OOM killer killed, but only logging all processes will tell us exactly why it happened. A single log is not useful either; you want to know which process actually grew continuously.)
When you find the process that grows overmuch, and you run it as a service, you can use say a daily script to restart that service, until you fix the memory leak.
Otherwise, you can use a daily script that uses e.g.
kill -HUP $(pidof processname) or
killall -HUP processname to send the memory-hogging process a hangup signal (
-TERM and
-KILL are alternatives; you can even use the HUP as a gentle reminder, TERM as a hard request, and KILL as no-questions-asked signals with a couple of seconds grace time in between), and then re-start the process.
This all depends on first determining which process or processes do leak memory, though.
For future reference:
You can set process limits using e.g.
prlimit for each individual process (and their children) when you start them. Instead of
/path/to/binary arguments...have your script run e.g.
/usr/bin/prlimit --as=64M:64M /path/to/binary arguments...This means that the process cannot exceed the given limit. The
as option limits the size of the address space in use by the process. If it tries to exceed that, syscalls like
sbrk() and
mmap() will fail; in practice, it means that memory allocation will fail.
Similarly, for processes that are less important than others, you can use
nice -n level (process priority) and/or
ionice -n level to reduce their priorities. For nice, the level is 0 (default) to 20 (least important), and for ionice, 0 (default) to 7 (least important). Just chain them like prlimit above.
You can even "manually" change the OOM score of the process (identified by its process ID number, and controlled by
/proc/PID/oom_adj and
/proc/PID/oom_score_adj, or
/proc/PID/oom_score), so that if OOM occurs, the OOM killer targets or avoids (based on the adjustment or score) that particular process. See
man 5 proc for the descriptions of these pseudofiles.
Many people often complain about the Linux OOM killer, without realizing that it simply applies a policy that us humans can trivially set in userspace. The default policy is a compromise between all sorts of use cases, intended to be "do the least harm in any situation", with a heuristic on how it targets the processes it kills. This has actually been a computer science research topic for a couple of decades at least, and nobody has found a better automatic heuristic on choosing what to kill; it
sounds like a simple problem, but when you realize that different types of services and applications naturally use different amounts of resources, and their importance varies depending on what the use for the device is, it becomes intractable. The only sane solution is to use minimally invasive defaults, and let human administrators tune the settings as needed, if needed. Usually, it's not needed.
Process limits are a separate mechanism. You can limit even how much CPU time a process can stay running. Memory limits affect how much memory the process can request from the kernel. These all have a soft limit, and a hard limit. For example, when the process exceeds the soft CPU time limit by one second, it gets sent an SIGXCPU signal. If the process then exceeds the hard CPU time limit, it is killed via SIGKILL signal, which cannot be blocked or stopped at all.
Since different workloads are usually groups of processes and not single processes, you can utilize control groups AKA
cgroups. For typical small appliance use cases, you don't need those; it's when you have multiple concurrent workloads that need different limits so they don't unduly affect each other, that cgroups become most useful.
Some Linux SBCs, but not Pis as far as I know, use ARM big.LITTLE architecture, where the same System-On-Chip has both "big"/fast CPU cores and "small"/slow CPU cores. For these, the
taskset command can be used like the resource use limiting commands above, to specify which CPUs the process (and its child processes) can use. Combined with nice and ionice, this allows restricting a process or a set of processes to run only on the "small" (or "big"!) cores.