Products > Embedded Computing

[solved] Linux/Arm, catastrophic crashes

(1/7) > >>

DiTBho:
I have tested several kernel from 5.4.128 to 5.15.0, and the result is always the same:
- within 48h of burn-in testing
- the kernel crashes


--- Code: ---neo / # XFS (sda3): Metadata corruption detected at 0xc037735c, xfs_inode block 0x953140 xfs_inode_buf_verify
XFS (sda3): Unmount and run xfs_repair
XFS (sda3): First 128 bytes of corrupted metadata buffer:
00000000: 6d 20 25 54 20 25 73 20 25 54 0a 00 46 6f 75 6e  m %T %s %T..Foun
00000010: 64 20 6e 65 77 20 72 61 6e 67 65 20 66 6f 72 20  d new range for
00000020: 00 00 00 00 41 73 73 65 72 74 69 6f 6e 73 20 74  ....Assertions t
00000030: 6f 20 62 65 20 69 6e 73 65 72 74 65 64 20 66 6f  o be inserted fo
00000040: 72 20 00 00 0a 09 42 42 20 23 25 64 00 00 00 00  r ....BB #%d....
00000050: 0a 09 45 44 47 45 20 25 64 2d 3e 25 64 00 00 00  ..EDGE %d->%d...
00000060: 0a 09 50 52 45 44 49 43 41 54 45 3a 20 00 00 00  ..PREDICATE: ...
00000070: 0a 41 53 53 45 52 54 5f 45 58 50 52 73 20 74 6f  .ASSERT_EXPRs to
XFS (sda3): metadata I/O error in "xfs_trans_read_buf_map" at daddr 0x953140 len 32 error 117
XFS (sda3): xfs_imap_to_bp: xfs_trans_read_buf() returned error -117.
XFS (sda3): xfs_do_force_shutdown(0x8) called from line 3749 of file fs/xfs/xfs_inode.c. Return address = 7769d00a
XFS (sda3): Corruption of in-memory data detected.  Shutting down filesystem
XFS (sda3): Please unmount the filesystem and rectify the problem(s)
Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000007

--- End code ---

I have exhaustively tested
- the ram: all tests passed
- the hard-drive (and tested four different models): all tests passed with badblocks and SMART
- the operating temperature is 22C thanks to a big heat-sink + fan cooling

So I'm running out of ideas :-//

the burn-in test consists of
- 8GB of swap activity, moving up and down the virtual memory over 512MB of physical ram
- 90% of CPU activity, but only 2 of 4 cores used
- the FPU is not used


The system appears to work "ok" for 12h, but within 48h it crashes. The problem here is what and how to investigate.

(edit: I seriously don't recommend Allwinner's stuff)

DiTBho:
it happens (tested) with
- ext4
- ext2
- xfs-v4
- btrfs

So I can rule out any filesystem bug, rather it smells something bad with swap? or with the EHCI-bulk module. The storage is a USB to sATA hard-drive connected to the first EHCI controller.

DiTBho:
It is catastrophic because it severely damages the file system to the point that the partition must be wiped, formatted and restored from backup.

Marco:
Could the power be dipping?

DiTBho:

--- Quote from: Marco on October 01, 2021, 11:35:13 pm ---Could the power be dipping?

--- End quote ---

The PSU I am using is a professional laboratory power supply manufactured by Rigol.
The SoM module eats << 400mA @ 5V, the max peak is 600mA, the current is limited to 1A.

The hard-drive is powered by the second channel of my PSU, it eats 1.1A @ 5V as max peak (never measured, but it's what Seagate said, I see less current absorbed, anyway) the current is limited to 2A.


What makes me perplexed: I have enabled all the kernel "debugging" and "verbose" support (even the memory leak checkers and SCSI verbose logs) and can successfully run exhaustively read/write tests that continuously moves 50Mbyte/sec for 96hours without a single clue of crash.

As well as I can mount 400Mbyte as ramdisk and successfully run exhaustively read/write tests that continuously moves 300Mbyte/sec for 96hours without a single clue of crash.




Navigation

[0] Message Index

[#] Next page

There was an error while thanking
Thanking...
Go to full version