Author Topic: [solved] Linux/Arm, catastrophic crashes  (Read 4020 times)

0 Members and 1 Guest are viewing this topic.

Offline DiTBho

  • Super Contributor
  • ***
  • Posts: 1133
  • Country: gb
[solved] Linux/Arm, catastrophic crashes
« on: October 01, 2021, 10:41:31 pm »
I have tested several kernel from 5.4.128 to 5.15.0, and the result is always the same:
- within 48h of burn-in testing
- the kernel crashes

Code: [Select]
neo / # XFS (sda3): Metadata corruption detected at 0xc037735c, xfs_inode block 0x953140 xfs_inode_buf_verify
XFS (sda3): Unmount and run xfs_repair
XFS (sda3): First 128 bytes of corrupted metadata buffer:
00000000: 6d 20 25 54 20 25 73 20 25 54 0a 00 46 6f 75 6e  m %T %s %T..Foun
00000010: 64 20 6e 65 77 20 72 61 6e 67 65 20 66 6f 72 20  d new range for
00000020: 00 00 00 00 41 73 73 65 72 74 69 6f 6e 73 20 74  ....Assertions t
00000030: 6f 20 62 65 20 69 6e 73 65 72 74 65 64 20 66 6f  o be inserted fo
00000040: 72 20 00 00 0a 09 42 42 20 23 25 64 00 00 00 00  r ....BB #%d....
00000050: 0a 09 45 44 47 45 20 25 64 2d 3e 25 64 00 00 00  ..EDGE %d->%d...
00000060: 0a 09 50 52 45 44 49 43 41 54 45 3a 20 00 00 00  ..PREDICATE: ...
00000070: 0a 41 53 53 45 52 54 5f 45 58 50 52 73 20 74 6f  .ASSERT_EXPRs to
XFS (sda3): metadata I/O error in "xfs_trans_read_buf_map" at daddr 0x953140 len 32 error 117
XFS (sda3): xfs_imap_to_bp: xfs_trans_read_buf() returned error -117.
XFS (sda3): xfs_do_force_shutdown(0x8) called from line 3749 of file fs/xfs/xfs_inode.c. Return address = 7769d00a
XFS (sda3): Corruption of in-memory data detected.  Shutting down filesystem
XFS (sda3): Please unmount the filesystem and rectify the problem(s)
Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000007

I have exhaustively tested
- the ram: all tests passed
- the hard-drive (and tested four different models): all tests passed with badblocks and SMART
- the operating temperature is 22C thanks to a big heat-sink + fan cooling

So I'm running out of ideas :-//

the burn-in test consists of
- 8GB of swap activity, moving up and down the virtual memory over 512MB of physical ram
- 90% of CPU activity, but only 2 of 4 cores used
- the FPU is not used


The system appears to work "ok" for 12h, but within 48h it crashes. The problem here is what and how to investigate.

(edit: I seriously don't recommend Allwinner's stuff)
« Last Edit: October 25, 2021, 11:37:23 am by DiTBho »
 

Offline DiTBho

  • Super Contributor
  • ***
  • Posts: 1133
  • Country: gb
Re: Linux/Arm, catastrophic crashes
« Reply #1 on: October 01, 2021, 10:45:38 pm »
it happens (tested) with
- ext4
- ext2
- xfs-v4
- btrfs

So I can rule out any filesystem bug, rather it smells something bad with swap? or with the EHCI-bulk module. The storage is a USB to sATA hard-drive connected to the first EHCI controller.
 

Offline DiTBho

  • Super Contributor
  • ***
  • Posts: 1133
  • Country: gb
Re: Linux/Arm, crashes
« Reply #2 on: October 01, 2021, 10:48:28 pm »
It is catastrophic because it severely damages the file system to the point that the partition must be wiped, formatted and restored from backup.
 

Offline Marco

  • Super Contributor
  • ***
  • Posts: 5367
  • Country: nl
Re: Linux/Arm, catastrophic crashes
« Reply #3 on: October 01, 2021, 11:35:13 pm »
Could the power be dipping?
 

Offline DiTBho

  • Super Contributor
  • ***
  • Posts: 1133
  • Country: gb
Re: Linux/Arm, catastrophic crashes
« Reply #4 on: October 02, 2021, 08:43:22 am »
Could the power be dipping?

The PSU I am using is a professional laboratory power supply manufactured by Rigol.
The SoM module eats << 400mA @ 5V, the max peak is 600mA, the current is limited to 1A.

The hard-drive is powered by the second channel of my PSU, it eats 1.1A @ 5V as max peak (never measured, but it's what Seagate said, I see less current absorbed, anyway) the current is limited to 2A.


What makes me perplexed: I have enabled all the kernel "debugging" and "verbose" support (even the memory leak checkers and SCSI verbose logs) and can successfully run exhaustively read/write tests that continuously moves 50Mbyte/sec for 96hours without a single clue of crash.

As well as I can mount 400Mbyte as ramdisk and successfully run exhaustively read/write tests that continuously moves 300Mbyte/sec for 96hours without a single clue of crash.




 

Offline Twoflower

  • Frequent Contributor
  • **
  • Posts: 688
  • Country: de
Re: Linux/Arm, catastrophic crashes
« Reply #5 on: October 02, 2021, 08:50:43 am »
If its Hard-Disk as real spinning disk, I would expect higher peak currents in case the heads move around. They're probably very short. Attach a scope to the 5V HDD supply, in single-shot with trigger level around 4.5V (I would expect 5V -10% is still accepted)  and wait to see if voltage drips happen.
 
The following users thanked this post: DiTBho

Offline evb149

  • Super Contributor
  • ***
  • Posts: 1999
  • Country: us
Re: Linux/Arm, catastrophic crashes
« Reply #6 on: October 02, 2021, 09:25:02 am »
Just some miscellaneous quick thoughts:

1: Depending on your system hw and sw you might be able to run something like QEMU to emulate parts of the configuration and programs etc. Then you could see if the similar system you construct and run in a VM shows signs of instability over days time or not.  If not maybe it points to something more relevant to the HW / drivers than the higher level SW functions.

2: Certainly you can wish for a stable system and seek to have high availability / reliability / MTBF for crashes.  However in typical situations you must always assume that your system CAN and WILL at some point crash whether due to a power failure, SW bug, HW problems, EMI, environmental problem (over heating, too cold, ...), cosmic rays, whatever. Therefore if you accept that the system should be crash-tolerant, it sounds like it may be undesired and possibly unacceptable for the disc partitions to be corrupted and lose data in case of a crash.  Therefore you may wish to consider your data integrity models relating to storage / partitions / file systems etc.  RAID based storage can help some cases.  Having redundant storage HW also.  You can configure many file systems to make use of journaling etc. to make recovery and data loss risks more predictable and controllable.  You can control the way data is written to disc e.g. the OS/application always forces writes to the disc to completely finish before returning from the "write()/..." call, or data may be write-back cached but the dirty pages / blocks are forced to be "sync"ed to disk every N-NNN seconds to limit loss of buffered unwritten data, etc.  There are copy on write file systems which helps ensure that you always have a correct copy of all files and you only may lose updates which have not been completely finished but that means you still have correct "old" copies of the files never corrupt ones, etc.

3: You should tune the swapping and cacheing etc. to see if something is using too much memory over time and you have resource exhaustion.

4: Check out what you may have enabled in a similar domain as cron jobs or periodically scheduled system maintenance functions like log file rotation, auto-backup, file content / directory indexing, online disk checks, automated software updates checks, anti-malware scans, mail processing, whatever is actually enabled.  It is common to have things that run 1/day, 1/2-days, N-days/week, 1/week, 1/month, etc. which can mean usually the system runs fine and yet unexpectedly after some time it becomes busy, slow, gobbles lots of memory / disk IO, crashes, whatever.

5: Speaking of log files beware that they're constrained in size and rotation logic or that's another thing that can itself cause the system to fail after some extended time as the logs fill up the temporary space, log partition, whatever.

6: Do surface scans / extended analysis of your media.  It is not uncommon to have scattered defective blocks / regions / sectors on disc drives that may fail writing or reading even though most areas are OK for now.  If these are not scanned for they'll randomly be hit and will crash the system absent some kind of RAID or other intelligence to ameliorate unrecoverable I/O read/write errors.

7: Also pattern test the memory RAM, many times you can have a system that is 100% stable under a stress CPU / memory benchmark for many hours but 1/day or 1/2-days or even less often it may often crash because of some low probability or isolated location error that won't often arise and be triggered but can repeatably be found as caches grow or something that exercises not normally used RAM areas etc.

8: Watch out for power management, sometimes you can have some peripherals like pci devices, usb devices, networking HW, the CPU, disc drives, whatever eventually go into some power saving mode and maybe the driver / kernel / bios / motherboard / hardware is not stable in such a case and the system will inevitably crash then when they go away or try to wake up.


 
The following users thanked this post: DiTBho

Offline DiTBho

  • Super Contributor
  • ***
  • Posts: 1133
  • Country: gb
Re: Linux/Arm, catastrophic crashes
« Reply #7 on: October 02, 2021, 11:05:06 am »
QEMU to emulate parts of the configuration and programs etc

That's a good idea.
 

Offline PKTKS

  • Super Contributor
  • ***
  • Posts: 1423
  • Country: br
Re: Linux/Arm, catastrophic crashes
« Reply #8 on: October 02, 2021, 11:26:46 am »
There has been issues with swap in kernel 5x branch.

Some were claimed fixed  although depending on the hardware itself I would not put that into production status.

For the sake of eliminating any chance of branch you can (if possible by hardware) check the ultra stable production quality 4.19.x branch or other close.

If your hardware does not require anything that new pretty much your chances will improve

Paul
 

Offline DiTBho

  • Super Contributor
  • ***
  • Posts: 1133
  • Country: gb
Re: Linux/Arm, catastrophic crashes
« Reply #9 on: October 02, 2021, 11:32:41 am »
real spinning disk

Yes, it's a 2.5" ElectroMechanical Hard Disk Drive spinning at 5400 rpm.

When one month ago I exhaustively tested the usb-bulk, I used progressive LBA addresses to read write read-back blocks (this is also what the program "badblocks" does), but if the problem is related to high current peak when heads move around, that means I need to run tests with random LBA addressing, which is the case of what the kernel does with the virtual memory, otherwise progressive LBA addresses correspond to progressive slow head movements, therefore less current peak.

Mumble

but it somehow makes sense, I mean the virtual memory I observed during the kernel stress-test moved the heads like a deranged dog on speed (I heard the sound the heads made, kind of "tic, tic, tic, tac, tac, tic"), and probably at some point this may require an high peak of current that the on-board LDO is unable to provide.

Umm, so a problem with the power on-board module, probably a tanker-capacitor or something similar  :-//

I will investigate and also try with a Solid state disk (flash-SSD or sdram-disk). But first I want to re-test everything on a virtual machine to make sure it's not a kernel bug.

(oh, here I assume the rootfs cannot crash the kernel, but it's a strong conjecture)

Thanks guys! Great advice!  :D
 

Offline DiTBho

  • Super Contributor
  • ***
  • Posts: 1133
  • Country: gb
Re: Linux/Arm, catastrophic crashes
« Reply #10 on: October 02, 2021, 11:40:54 am »
There has been issues with swap in kernel 5x branch.
Some were claimed fixed  although depending on the hardware itself I would not put that into production status.

That's worrying but it also explains why individual tests pass, but the system crashes during swap activities.

So it also makes sense to re-try everything with k4.19.*
Thanks!

edit:
I also tested K5.14.1, no progress for me.
« Last Edit: October 02, 2021, 11:48:08 am by DiTBho »
 

Online ejeffrey

  • Super Contributor
  • ***
  • Posts: 2749
  • Country: us
Re: Linux/Arm, catastrophic crashes
« Reply #11 on: October 02, 2021, 06:01:26 pm »
There has been issues with swap in kernel 5x branch.

Do you have a citation for that?  Is that platform specific?
 

Online ejeffrey

  • Super Contributor
  • ***
  • Posts: 2749
  • Country: us
Re: Linux/Arm, catastrophic crashes
« Reply #12 on: October 02, 2021, 06:24:16 pm »
You say you have exhaustively checked the ram.  How?  For how long?  Have you tried walking back the memory timings?

How many of these units do you have?  Does it occur on multiple systems or just one?

What is the die temperature? Surely not 22 C, maybe the chip has crappy thermal packaging and isn't cooling enough or thermal throttling properly?  Try backing down the clock speed?

 

Offline Nominal Animal

  • Super Contributor
  • ***
  • Posts: 3272
  • Country: fi
    • My home page and email address
Re: Linux/Arm, catastrophic crashes
« Reply #13 on: October 02, 2021, 07:07:41 pm »
The SoM module eats << 400mA @ 5V, the max peak is 600mA, the current is limited to 1A.

The hard-drive is powered by the second channel of my PSU, it eats 1.1A @ 5V as max peak (never measured, but it's what Seagate said, I see less current absorbed, anyway) the current is limited to 2A.
In my experience, SBC current use is very spiky, and they don't typically have much bulk capacitance on the power rail.

(This is basically the reason I started developing this Teensy 3.2 carrier for Odroid HC1.  It takes a strictly regulated 5V in, but when using a 3.5" spinny-rust HDD, can consume up to 4-6A.  The idea was to actually carefully measure whether there were any sub-microsecond dips in the voltage due to current use spiking.  Such minimally short dips sound like they shouldn't affect anything, but remember, these SBCs really don't have any bulk capacitance on their power inputs to smooth them out, not at these current use levels.  And on a lab power supply, there usually isn't much bulk capacitance on the output either, so you're more likely to see this on a lab power supply than on a typical 5.15V wall warts that tend to have at least some bulk capacitance as an output filter.)

(The scenario I'm thinking is that various actions cause small current use spikes, and it is only a rare occurrence when the SBC operations combine just right to cause a deep enough dip: a very short current draw spike, that drops the power rail just enough to cause a kernel oops, perhaps due to a memory glitch, or a support chip glitch.  I don't think the SoC chip is too sensitive to these, especially if they use switchers (SMPS) for their various power rails.  The symptoms do fit, but this is less likely than a pure software (kernel) bug; it is just one thing to rule out as a possible cause.)

I suggest ruling this out, by putting some bulk capacitance between your SBC input rails, say a couple of milliFarads of low-ESR capacitance (electrolytics) between the input power rail and ground (like bypass caps on chips), and re-test for a few days to see if this affects the stability of your SBC.  It takes time, but is not difficult to do; and you can spend the test time poring over the kernel sources and LKML to see if anything similar (to the commonalities you've seen in the OOPSes when using different filesystems) has been reported.  In other words, drudge work ensues...
« Last Edit: October 02, 2021, 07:10:29 pm by Nominal Animal »
 
The following users thanked this post: DiTBho

Offline DiTBho

  • Super Contributor
  • ***
  • Posts: 1133
  • Country: gb
Re: Linux/Arm, catastrophic crashes
« Reply #14 on: October 02, 2021, 08:05:04 pm »
You say you have exhaustively checked the ram.

Yup

How? 

Different methods.
  • u-boot memtest
  • Linux "memtest/ramspeed" (it's a dedicated program)
Code: [Select]
  Stuck Address       : ok
  Random Value        : ok
  Compare XOR         : ok
  Compare SUB         : ok
  Compare MUL         : ok
  Compare DIV         : ok
  Compare OR          : ok
  Compare AND         : ok
  Sequential Increment: ok
  Solid Bits          : ok
  Block Sequential    : ok
  Checkerboard        : ok
  Bit Spread          : ok
  Bit Flip            : ok
  Walking Ones        : ok
  Walking Zeroes      : ok
  8-bit Writes        : ok
  16-bit Writes       : ok

For how long?

96h

Have you tried walking back the memory timings?

I patched u-Boot with the lowest DRAM_CLK, and retested everything for 96h.
All tests passed, and thanks to the new lower clock the ram-chip has a better temperature.

How many of these units do you have?  Does it occur on multiple systems or just one?

Four units with the same features (ram, disk, etc), tested in parallel. The same result.
Each burn-in test is scheduled for two days; I launch the test in parallel in a corner of the lab, I wait for two days, then I check the results, which includes information on uptime and debugging (sadly scarce and yet any I / O profiling information).

What is the die temperature? Surely not 22 C

The thermal sensor inside the SoC reports 37 C (1), the thermal sensor near the heat-sink reports 22 C.
The air temperature of the room is 19 C. I don't have an IR camera. The critical SoC temperature should be above 50 C.

(1) highest reported temperature, with two cores at 90%, two core at 0-5%

Code: [Select]
source="/sys/class/thermal/thermal_zone0/temp"
if [ -f $source ]
   then
       echo $((`cat $source` / 1000))
   else
       echo "$source not found"
   fi

I disabled the speed stepping, the SoC always and only operates with the lowest CPU clock possible, which is 600Mhz instead of 1Ghz, and the GPU is disabled.
« Last Edit: October 02, 2021, 09:18:57 pm by DiTBho »
 

Offline DiTBho

  • Super Contributor
  • ***
  • Posts: 1133
  • Country: gb
Re: Linux/Arm, catastrophic crashes
« Reply #15 on: October 02, 2021, 08:31:26 pm »
3.5" spinny-rust HDD, can consume up to 4-6A

Considering that CF-Microdrive are more gentile, 2.5" HDDs spinning at 5400 rpm shouldn't have the same peak current you see in 3.5" HDDs spinning at 7200 rpm.

My conjecture

But for instance my SCSI-SCA U320 Fujitsu Raptor spinning at 12000 rpm is literally a monster that makes a lot of noise when it starts because it sounds like a jet-turbine, and it's known to have high peak current (>> 5A) when the BLDC motor (4 phases) starts to spin and accelerate, in fact there is a "soft start" mode to manage it.

I don't know, like you said, this stuff needs more measurements :-//

(thanks for advice, it's super useful)


edit:
see here, High-speed drives by Seagate Barracuda take 8W on average with high current on-start-up / spin-up, but from the same brand Seagate a 2.5" mechanical drive consumes a bit less because the discs inside are smaller and lighter, so the motor doesn't need as much oomph, torque, to keep the discs spinning, and the spinning speed itself is 5400 vs 7200 rpm.
« Last Edit: October 02, 2021, 09:06:07 pm by DiTBho »
 

Offline Twoflower

  • Frequent Contributor
  • **
  • Posts: 688
  • Country: de
Re: Linux/Arm, catastrophic crashes
« Reply #16 on: October 02, 2021, 09:47:57 pm »
I would be more worried to the head actuator. At the acceleration rates of the head-stack in modern drives they're reasonable power hungry. And thus will produce very spiky currents. Especially during frequent back and forth movements. Like copy on same drive, access lots of small areas spread over the whole platter(s).

Edit: This paper shows the power consumption during seek (An Analysis of Hard Drive Energy Consumption
, Picture 11). But the 2.5" you're using wont be that bad as the picture shows an older 3.5" drive.
« Last Edit: October 02, 2021, 10:07:34 pm by Twoflower »
 
The following users thanked this post: DiTBho

Offline DiTBho

  • Super Contributor
  • ***
  • Posts: 1133
  • Country: gb
Re: Linux/Arm, catastrophic crashes
« Reply #17 on: October 02, 2021, 10:11:11 pm »
I would be more worried to the head actuator. At the acceleration rates of the head-stack in modern drives they're reasonable power hungry.

I am using 160GB and 320GB 2.5" sATA HDDs (~2010). Are modern enough? Or is it what you have with current 1TB ans 2TB HDDs? :-//

And thus will produce very spiky currents. Especially during frequent back and forth movements. Like copy on same drive, access lots of small areas spread over the whole platter(s).

Damn, this exactly what the kernel does with virtual memory. A lot of fast copies with random LBA addresses.


I am wondering if it is better to force a deferred I/O in the kernel to slow down the read/write operations on the swap and make them with more progressive LBA addresses; this means you have to add requests to a queue, sort it and process all the virtual memory requests later.


I am also tempted to move the swap to the network, even if the NIC can only negotiate a 100Mbit/s link, which means 8Mbyte/sec, ~12 times slower than swap on a HDD partition.

At the moment I am preparing the testing environment for k4.14.199, and I am compiling qemu-arm to test the kernel itself in a virtual machine.

I will run this stuff on Monday.
 

Offline DiTBho

  • Super Contributor
  • ***
  • Posts: 1133
  • Country: gb
Re: Linux/Arm, catastrophic crashes
« Reply #18 on: October 05, 2021, 10:32:23 pm »
two machines with k5.15.1 + patches + hack have been running for 29 hours without a crash  :o :o :o

(fingers crossed)


p.s.
In "kernel next", I see disturbing patches for ext4 an kvm
 

Offline DiTBho

  • Super Contributor
  • ***
  • Posts: 1133
  • Country: gb
Re: Linux/Arm, catastrophic crashes
« Reply #19 on: October 05, 2021, 10:35:45 pm »
Two machines have been running k4.14.199 for 25 hours with a crash
And I have a k4.14.199 running Qemu/arm

We will see ...
 

Offline Nominal Animal

  • Super Contributor
  • ***
  • Posts: 3272
  • Country: fi
    • My home page and email address
Re: Linux/Arm, catastrophic crashes
« Reply #20 on: October 05, 2021, 10:43:27 pm »
Have you added a few milliFarads of bulk capacitance to the power input?  I'd be very interested to know if it helps or not.
 

Offline DiTBho

  • Super Contributor
  • ***
  • Posts: 1133
  • Country: gb
Re: Linux/Arm, catastrophic crashes
« Reply #21 on: October 06, 2021, 09:09:21 am »
Have you added a few milliFarads of bulk capacitance to the power input?  I'd be very interested to know if it helps or not.

Not yet  :o

This is the same hardware in the same configuration, just with different kernels.
Yesterday I ordered qty=2 new boards and I will test them with bulk capacitance.

Anyway, the sATA hard-drive has never been powered by the SBC, it has a dedicated power supply line since I am using my professional laboratory PSU to supply each board and each hard-drive independently.

(I asked my boss to take five PSUs at home from work, and she agreed to lend me them. Nice girl! But in the future I will have to by a second Rigol PSU, for sure)

Code: [Select]
Rigol PSU
 ___________
|           |
|    PSU    |
|           |
| ch0 | ch1 |
|     |     |
| 12V | 05V |
| + - | - + |
|_|_|_|_|_|_|
  | |   | |
  | +---+ |
  |  GND  |
Vcc12   Vcc05

 ___________
|        |  |
| HDD2.5"|  |==== usb2
|  sATA  |  |=============== { Vcc05 , GND }
|________|__|

 ___________
|           |
|           |======== usb2 (to sATA disk)
|           |======== eth0  (to testing machine)
|    SBC    |======== uart0 (to testing machine)
|    ARM    |
|           |
|           |
|           |=============== { Vcc12 , GND }
|___________|


This means, the capacitor will help only if the problem is related to the SoC.

With the k5.15.1 the temperature looks slightly higher

Code: [Select]
k5.15.1 # mytemperature
34

Code: [Select]
k4.15.199 # mytemperature
29

I removed the cooling fan as it was powered by the SBC and I want to make sure there is no interference from any BLDC motor, the SoC now only works with the aluminum heat-sink.

Code: [Select]
# mycpufreq
648Mhz@cpu0
648Mhz@cpu1
648Mhz@cpu2
648Mhz@cpu3
Other valid values for the PLL are 816Mhz and 1008Mhz

All profiles are "power-save", and the speed-stepping is disabled, so this is a fixed frequency that never changes, and it's the same core clock frequency used in k4.15.199, but k5.15.1 probably different algorithms running hence they move more data therefore more transistors are switched per time therefore consume more current, therefore has slightly higher temperature.

It makes sense  :o :o :o
 

Offline DiTBho

  • Super Contributor
  • ***
  • Posts: 1133
  • Country: gb
Re: Linux/Arm, catastrophic crashes
« Reply #22 on: October 06, 2021, 09:17:26 am »
The latest stable k4.9 is k4.9.285.
It has just been released.
The updated 4.9.y git tree can be found at:
Code: [Select]
        git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git linux-4.9.y

I will!
 

Offline Nominal Animal

  • Super Contributor
  • ***
  • Posts: 3272
  • Country: fi
    • My home page and email address
Re: Linux/Arm, catastrophic crashes
« Reply #23 on: October 06, 2021, 09:57:21 pm »
This means, the capacitor will help only if the problem is related to the SoC.
Right.

I guess using an oscilloscope in trigger mode (on the SBC power rail) could also be interesting.  It doesn't have to have much bandwidth, since any glitches under say 1µs (corresponding to 1 MHz) should be filtered out. My own worries are glitches in the 1µs to 100ms range: long enough to not be fully filtered out due to lack of bulk capacitors on an SBC, but short enough to not cause a reliably triggered crash.

A microcontroller like Teensy monitoring the voltage at say 100kHz rate (via a voltage divider and a buffer, like a TLV2371 RRIO opamp in my design), can do this continuously, and collect statistical information –– that design for Odroid HC1 is such that the Teensy can be connected to a remote computer, and log both serial console and any voltage glitches: basically be an UART-to-USB adapter with two TTY endpoints on the host.  The high-side current switch is there to act as a power-off switch, since the Odroid HC1 does not have one itself.

I guess I should make a Teensy LC or Teensy 4.0 version carrier of that (perhaps leaving the current measurement and power switch sections optional?), since the 74LVC1T45 transceivers mean the SBC side UART voltage levels can be anything from 1.8V to 5.5V, as long as you have a suitable reference voltage.  For this kind of testing (recording voltage glitches during a many-hour run), and interfacing to new/unknown devices like routers (due to the voltage shifters), it could be useful...  The key being that on the host, you have one TTY endpoint directly to the SBC UART, and one that reports the voltage (continuously, or only changes or glitches) on the measured power rail (and can act as a control channel to the Teensy itself).  (Using 16-bit binary, 100 kHz sample rate only takes about one fifth of the available practical bandwidth.  All Teensies easily reach about 1,000,000 bytes/sec in Linux via TTY aka USB Serial even when using a cheap, USD $10 USB isolator; I've tested this in practice.)
 

Offline DiTBho

  • Super Contributor
  • ***
  • Posts: 1133
  • Country: gb
Re: Linux/Arm, catastrophic crashes
« Reply #24 on: October 07, 2021, 04:10:55 am »
The "uptime" is now > 48h, boards are all still fine  :o :o :o

Probably removing the cooling fan was a good idea, or completely unrelated to a deterministic cause, but during previous tests no single k5.1* kernel passed 24h burn-in when the cooling fan was in use: it was its tiny BLDC motor the root of this weird problem ... who knows? :-//

New patches also help. My goal is to get a working trust-able SBC, I am seriously late with this project. I wasted two months with Allwinner's bugs including bugs with their u-boot.

It would be funny if the problem was the bldc motor of the cooling fan, which is attached on the SBC and sinks current from the same LDO that powers the SoC, but in this case you wouldn't see anything observing the SBC-power-rail-input of the board because the problem is internal the board and you should put the current and level monitor just after the LDO.
« Last Edit: October 07, 2021, 05:29:31 am by DiTBho »
 
The following users thanked this post: Nominal Animal


Share me

Digg  Facebook  SlashDot  Delicious  Technorati  Twitter  Google  Yahoo
Smf