Upstream Kernel Panics
# │forum
s
Hi, I'm aware this isn't an officially supported build, but I created an RK1 debian build based on https://gitlab.collabora.com/hardware-enablement/rockchip-3588 on both kernels 6.8 and 6.9-rc1. They install and run and I'm able to run an Ubuntu 22 LXC container from an lvm-thin volume on my SSD. The bug: seems to be triggered by disk writes. I was building an OpenWrt 23.05.3 image utilizing all 8 cores with 16GB RAM and about 20 minutes in, I was hit with a panic. It's reproducable, happens every time and happened on multiple different chips so I don't think it's hardware related. I've attached the crash log. Pinging @Spooky and @CFSworks for visibility https://cdn.discordapp.com/attachments/1223055516104659026/1223055516666957955/message.txt?ex=66187636&is=66060136&hm=84f378bf34b28ef383e5f19a68ceba69dfe42bacc8e7adca829ac398ca92959e&
I can provide the exact image if needed as well
c
This really looks like some kind of memory corruption in the inode cache.
s
i could try kernel 6.7
could it be a clock thing?
c
It happens way too consistently in the inode cache for me to suspect clocks or other memory/hardware management. A memory test is never a bad idea, but this feels like the kernel is doing some kind of memory unsafety thing.
s
wouldn't it be not unique to a particular device then?
e.g. if i build the same image for x86 i should see it there
i feel like someone wouldve reported it already if so
c
Hard to say. I remember hearing about an issue like this that was caused by invoking UB and the compiler for only a particular architecture was producing errant code. But you're right that this should be reported elsewhere, because it's not like AArch64 is a "niche" platform.
I'd see how consistent the
[ 3392.667905] Unable to handle kernel paging request at virtual address 0000018001ffff78
is. What's interesting about that to me is:
Copy code
@ rasm2 -d -b64 -aarm -e 'f94002d3 b4000173 d1036273 b4000133 f9402263'
ldr x19, [x22]
cbz x19, 0x30
sub x19, x19, 0xd8
cbz x19, 0x30
ldr x3, [x19, 0x40]
@ hex(0x18000000010-0xd8+0x40)
'0x17fffffff78'
i.e. what's being loaded from x22 is
0x18000000010
which looks more like a flags field to me, and then pointer arithmetic is being done on that
Oh wait, messed up my arithmetic, sec
Copy code
@ hex(0x18002000010-0xd8+0x40)
'0x18001ffff78'
But still, an integer value with only 4 bits sparsely set screams "bitfield" to me.
s
hmmmm
c
well that makes it hard to see what's going on with the struct management. Maybe you just got very (un)lucky (depending on whether you consider being the first to discover a memory error "lucky") with the layout randomization. Do you happen to have
CONFIG_RANDSTRUCT_*
set?
s
lemme look
Copy code
$ zcat /proc/config.gz | grep CONFIG_RANDSTRUCT
CONFIG_RANDSTRUCT_NONE=y
yep
c
"None" would mean the randomization is disabled, so that's good for debugging (but there goes my previous guess)
s
oh
wouldve help if i read it lmao
wonder if i scuffed something with my kernel config
theres the whole kernel config
c
Ah, that you're on such an "atypical" config that the bug is happening?
this defconfig
added
CONFIG_DM_THIN_PROVISIONING=m
to get lvm-thin
then
Copy code
ARCH=arm64 CROSS_COMPILE=aarch64-linux-gnu- make olddefconfig
ARCH=arm64 CROSS_COMPILE=aarch64-linux-gnu- make bindeb-pkg -j $(nproc)
c
I've never used it and I'm taking a shot in the dark, but try enabling
CONFIG_KASAN
and rebuilding the kernel. There'll likely be a performance cost to this, but with any luck it'll identify exactly where the corruption is happening.
s
ok, gimme an hour or so to build and install
built, installing and triggering bug
c
Here's hoping it isn't a heisenbug
s
ok apparently i bricked it
c
Won't boot at all?
s
stuck in initramfs
c
Unable to load kernel modules due to a version skew? Or is there a KASAN problem flagged as early as initramfs?
s
not sure, lemme reinstall
almost done
alright building openwrt,,,
i liked, just starting the build now
mkay it wasnt even started and crash
does this help @CFSworks
c
It's going to take some studying to understand this. It doesn't look like KASAN was triggered but rather that KASAN's memory-tracking code caused a crash earlier in the execution.
s
🙂
thats not epic
c
I do think this crash is related to the same memory corruption, and earlier is always better since it's closer to the culprit.
s
anything i can provide or attempt or change?
this is a bit outside of my forte
c
I always feel like I'm flying by the seat of my pants with this kind of debugging too. I've never seen the same kind of problem twice.
s
i sent a message to the collabora guys too, but im doubtful they will reply
c
Here's an odd idea, but what about disabling cores 1-7 and only running core 0?
Adding
maxcpus=1
to the command line on boot ought to achieve that.
If the issue goes away entirely, then we know it's a data race. If it doesn't go away, it should get easier to debug. I'm wondering if this is corruption in "shared" caches and another core happens to trip over the corruption before the culprit thread is caught.
s
maxcpus=1 in uboot?
c
In the kernel
bootargs
, but yes set from U-Boot
s
i can limit it in lxc too
c
As long as it appears in
cat /proc/cmdline
, it should be good.
s
and pin it to one core
prob preferred in the kernel tho
c
The cmdline option should prevent the other cores from being powered up at all. But a good second test might be to use core affinity to achieve the same thing. 🤔
s
k hang on
KASAN makes the kernel take 5ever to start lol
c
Yeah, and it's gonna be a lot worse when on a single core.
s
this uboot is weird
setenv bootargs 'maxcpus=1'
?
c
That should be it, but does your /boot have a script that overrides the bootargs?
s
probably
cause it didnt work lol
either that or something in the bootcmd overwrites it
actually, theres the initrd, sysmap, kernel and the extboot config
c
extboot config might be interesting to look at. The others aren't part of that part of the boot chain.
But yes, also studying the bootcmd sounds good.
s
extlinux* sorry
bootcmd=bootflow scan -lb
bootflow is new to me
oh hang on theres a boot script
Copy code
scriptaddr=0x00c00000
c
extlinux/extlinux.conf would be the next file that bootflow looks at
s
it just looks like a grub config
bootmenu
Copy code
## /boot/extlinux/extlinux.conf
##
## IMPORTANT WARNING
##
## The configuration of this file is generated automatically.
## Do not edit this file manually, use: u-boot-update

default l0
menu title U-Boot menu
prompt 0
timeout 50


label l0
    menu label Debian GNU/Linux 12 (bookworm) 6.8.0-g235e32bb9813-dirty
    linux /boot/vmlinuz-6.8.0-g235e32bb9813-dirty
    initrd /boot/initrd.img-6.8.0-g235e32bb9813-dirty
    fdtdir /usr/lib/linux-image-6.8.0-g235e32bb9813-dirty/
    
    append root=UUID=afb0e1eb-b0ad-4b79-b9a2-8354818b3b63 rootwait

label l0r
    menu label Debian GNU/Linux 12 (bookworm) 6.8.0-g235e32bb9813-dirty (rescue target)
    linux /boot/vmlinuz-6.8.0-g235e32bb9813-dirty
    initrd /boot/initrd.img-6.8.0-g235e32bb9813-dirty
    fdtdir /usr/lib/linux-image-6.8.0-g235e32bb9813-dirty/
    append root=UUID=afb0e1eb-b0ad-4b79-b9a2-8354818b3b63 rootwait single
(6.8.0 on my other node)
prob the append line
c
I'd disregard the warning and add
maxcpus=1
to the
append
for now
s
yep
hang on
alright were online with one core
core 0
c
Have you also double-checked with
/proc/cpuinfo
?
s
Copy code
$ cat /proc/cpuinfo
processor    : 0
BogoMIPS    : 48.00
Features    : fp asimd evtstrm crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp
CPU implementer    : 0x41
CPU architecture: 8
CPU variant    : 0x2
CPU part    : 0xd05
CPU revision    : 0
yep
c
Cool. If nothing else I hope this makes the oops/panic output more legible
s
this may take a while, so im going to monitor the chip's uart from a tmux session
might be asleep by the time it crashes (if it does)
c
👍
s
appreciate the debugging help 🙂
c
I'm in a kernel debugging mood this week anyway
I'm multitasking this and also tracking down a crash in the experimental NVK open-source NVIDIA Vulkan driver at the same time
s
oh ive heard of that driver
my buddy was telling me about it
c
It's definitely not ready to be a "daily driver" but I'm impressed with how well it works already.
s
yea thats what im hearing
i heard the intel gpu drivers are actually decent too
c
Those very much are, they're my primary driver.
s
transcode capabilities are awesome too
other node just crashed doing a tarball extract
c
Other node meaning one with all 8 cores enabled?
s
yea
single core one is still churning
c
If the single core one doesn't die, I might start to suspect a cache coherency issue. The one big controversy with the RK3588 is it doesn't implement cache snooping on its interconnect (no idea if that includes caches within a single core cluster, or not). If this doesn't end in a crash, I'm wondering if there's some cache management issue that's masked by all of the platforms that do implement snooping.
(And the cache management issue is apparently unique to the ext4 code.)
s
sounds very suspect the way you describe it
theres often cma warnings too
unsure if theyre related
I may have just reset the wrong machine… let me re-run everything 😰
so i think it still crashed but i didnt see anything in the logs
im going to try again
a day later, i am positive it still crashes even on one core
@CFSworks 🙂
c
Definitely a kernel bug then. What's killing it, I have no idea. Seems like the first fault was in filesystem code though... the consistency of this happening in the filesystem doesn't seem like a coincidence.
s
Agreed
Maybe I should 6.7
c
If you can find a version where it doesn't happen, you could do a git-bisect
s
Problem is I can only go back so far for this board
6.6 was LTS yeah?
c
It was; you'll want to keep the .dtb from your latest build though, since the RK1 .dts landed in 6.7.
s
it might be related to lvm actually
c
Changing to a different fs but keeping lvm sounds like a good test
s
First I’m going to try in vanilla Debian w/o proxmox
so far so good on 8c with vanilla debian (no proxmox).
ok, crashed on vanilla debian, so it is indeed a kernel issue
alright ill build 6.7 now and try that
alright, openwrt compiling on 6.7. lets see if it crashes
crashed on 6.7 too
there has to be something else wrong, theres' no way a basic ext4 system would be crashing for 3 minor kernel versions
@CFSworks sure it’s nothing in the device tree?
It really almost seems like a clock now
c
I haven't encountered anything on my end, but that doesn't mean it's definitely error-free. What makes it seem like a clock?
s
I read something on the collabora git about certain clocks being unstable
Let me see if I can find it
It wouldn’t be a missing kernel module right? Since it technically “works”
c
It doesn't make sense for it to be a missing kernel module, no.
s
I can’t find that clk reference
Maybe I should try building the vanilla Linux kernel?
And not from collabora
Maybe they introduced a buggy patch
Oh here’s one
That’s for rock-5b though
Ok lemme try vanilla kernel
Else, I’m out of ideas
c
Also run a test with a non-ext4 filesystem. If the error still happens with, I dunno, xfs, then we know it's not ext4-related.
s
ok still crashed on pure vanilla kernel 6.9-rc2 tag
guess i need to try btrfs or xfs
sounds like a tomorrow problem
i think i need to re-build u-boot with btrfs support
interesting data point
i tried to dd an image from /tmp on my nvme to mmcblk0 and it panicked
Copy code
[ 1620.511576] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
[ 1620.520153] CPU: 6 PID: 1 Comm: systemd Not tainted 6.9.0-rc2 #1
[ 1620.526868] Hardware name: Turing Machines RK1 (DT)
[ 1620.532315] Call trace:
[ 1620.535042]  dump_backtrace+0x94/0xec
[ 1620.539144]  show_stack+0x18/0x24
[ 1620.542848]  dump_stack_lvl+0x38/0x90
[ 1620.546944]  dump_stack+0x18/0x24
[ 1620.550648]  panic+0x39c/0x3d0
[ 1620.554054]  do_exit+0x834/0x92c
[ 1620.557659]  do_group_exit+0x34/0x90
[ 1620.561650]  copy_siginfo_to_user+0x0/0xc8
[ 1620.566227]  do_signal+0x118/0x1378
[ 1620.570126]  do_notify_resume+0xc8/0x140
[ 1620.574508]  el0_undef+0x84/0x98
[ 1620.578113]  el0t_64_sync_handler+0xa0/0x12c
[ 1620.582884]  el0t_64_sync+0x190/0x194
[ 1620.586974] SMP: stopping secondary CPUs
[ 1620.591452] Kernel Offset: disabled
[ 1620.595344] CPU features: 0x4,00000003,80140528,4200720b
[ 1620.601280] Memory Limit: none
[ 1620.604690] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b ]---
alright, were on btrfs. lets see
@CFSworks btrfs crashed
this was on eMMC and not the nvme (ext crashes were all on nvme) so its not the drives/storage either
as proof
/dev/mmcblk0p3 on / type btrfs (rw,relatime,ssd,discard=async,space_cache=v2,subvolid=5,subvol=/)
c
A btrfs crash means it's more likely to be hardware related, yeah. The only patch of mine that I can think related is already upstream, though. This is a tricky problem.
s
but its crashed on 3/4 of my devices
seems less likely hardware, no?
actually no, its crashed on all 4
one or two also on emmc
c
Oh oops I guess I meant "driver related"
s
what driver(s) are you thinking?
probably not related to pcie
not mmc
c
Could be eMMC, but I have no idea. Does this happen consistently whether you have NVMes installed or not?
s
I haven't tried remove nvme drives yet, but I have left 2 nodes on (running from nvme) with low IO and they've been up for 2-3 days straight now
I sent an email to collabora requesting a little assistance
s
Sorry for delay in chiming in, ubuntu has been busy with beta launch coming up. Do we have any theory on the issue?
s
Nope
I’ve tested just about everything I can test
Same behavior on all my nodes so it’s not hardware
Tried with one single core enabled
Tried on ext4 and btrfs
Tried collabora kernels 6.7, 6.8 and 6.9 as well as upstream 6.9-rc2
s
Hmm i have a 6.7 build, would you be able to see if it also crashes?
s
I did try on 6.7
s
Its not collabra tho
s
I’m willing to give it a shot though
s
Its more of an Armbian + my suff kernel
s
Yeah drop it here I’ll try it
s
Oh geez its 3 months old
I keep getting pulled into the BSP kernel trap
c
What are the common elements of all of the crashes so far? - On an RK1 - Compiling OpenWrt heavy filesystem access to eMMC to any external (network/block) target - ext4 filesystem - LVM - Proxmox, not bare metal - Multiple CPU cores enabled
s
Compiling openwrt is just one way to crash it. I’ve been able to crash when copying a large file from an NFS share to another
As well as dd’ing a file from /tmp to mmcblk0
s
I have about 20 different rk3588 bords btw, if i can get the exact reproducible steps i can rule out that its RK1 specific
s
Screw the BSP kernel lol
s
6.1 is not that bad, but 5.10 was CURSED
c
Can we eliminate the eMMC as the culprit, by doing heavy filesystem access to the NVMe instead?
s
@Spooky try to mount an NFS share or two and copy a large file between (like 20g+)
I’ve tested on both eMMC and NVME. Same result
c
Does NFS->NFS trigger it?
s
Yes
Here’s my build
c
How about mounting tmpfs and repeatedly doing
dd if=/dev/urandom of=/tmp/test.bin bs=4096 count=1024
in a tight loop?
s
Hmmm i need to power on my nas, it has an NFS server with a bunch of ubuntu build artifacts i can try transferring
c
Can you trigger the crash by running
iperf3
or other non-filesystem I/O?
s
let me try iperf
s
Off topic: but with Panthor being merged in 6.10 we will finally have gpu support, ill likely send a patch for HDMI, but the edid quirk may take some time to figure out a proper patch.
s
im stoked for this
will try this after
s
Ive been talking to some of the GPU / VPU devs, they are insanely smart. I beleive AV1 support should be coming in soon.
s
also stoked for that, my whole media library is in av1
what exactly do you mean "mounting tmpfs" ?
s
Personally im waiting for 6.10 before doing any rebase on mainline. But im flashing a fresh 6.7 image now to test.
c
tmpfs is just an in-RAM filesystem.
/tmp
is typically one.
mount | grep /tmp
to check its type
(Sometimes it's not tmpfs, but a directory in
/
that gets cleared on boot)
s
Copy code
udev on /dev type devtmpfs (rw,nosuid,relatime,size=16245064k,nr_inodes=4061266,mode=755)
tmpfs on /run type tmpfs (rw,nosuid,nodev,noexec,relatime,size=3254680k,mode=755)
tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev)
tmpfs on /run/lock type tmpfs (rw,nosuid,nodev,noexec,relatime,size=5120k)
ramfs on /run/credentials/systemd-tmpfiles-setup-dev.service type ramfs (ro,nosuid,nodev,noexec,relatime,mode=700)
ramfs on /run/credentials/systemd-tmpfiles-setup.service type ramfs (ro,nosuid,nodev,noexec,relatime,mode=700)
tmpfs on /run/user/1001 type tmpfs (rw,nosuid,nodev,relatime,size=3254676k,nr_inodes=813669,mode=700,uid=1001,gid=1001)
i actually dont understand this one
Copy code
$ df /tmp
Filesystem     1K-blocks    Used Available Use% Mounted on
/dev/nvme0n1p3  32937936 3895364  27348492  13% /
c
So /run and /dev/shm are tmpfs, but it looks like /tmp is on the NVMe
s
it is
it shouldn't be
c
Sometimes that's done just to give /tmp more space than just RAM can handle
s
i think 32g is enough lol
c
Well,
mount -t tmpfs /tmp /tmp
if you'd like to use RAM instead
s
iperf for 5 mins at 1gbit didnt crash it
lemme write to /tmp (in ram)
for i in $(seq 1 1024); do dd if=/dev/urandom of=/tmp/test.bin bs=4096 count=20480; done
seems to be okay so far
aha!
seems to be reading from disk
or reading in general
not writing
dd if=/dev/urandom of=/tmp/test.bin bs=4096 count=2048000 status=progress
then copy this somewhere else, e.g. to an nfs share
cp /tmp/test.bin /mnt/share/test.bin
crashed instantly
logs from this one
c
I wonder if you should test a lower patchlevel of each kernel minor you're testing
It occurs to me that there might be a bad fix commit that got cherrypicked onto each of the 6.6-8 branches
Maybe try out 6.7.1 first and go from there?
s
yep that is 100% it, just crashed again immediately
lemme check out 6.7.1
6.7.1 crashes
im trying @Spooky 's 6.7 ubuntu build now...
ok, @Spooky your ubuntu 22 build doesnt crash
so something is b0rked with my build 🙃
i installed @Spooky 's 6.7 kernel onto my debian 12 machine. still crashed when copying a file from an nfs share back to itself
Copy code
[  564.470798] Unable to handle kernel paging request at virtual address 0068e851806854c4
[  564.479679] Mem abort info:
[  564.481027] Unable to handle kernel paging request at virtual address 0058db8f59c78b97
[  564.482787]   ESR = 0x0000000096000004
[  564.482790]   EC = 0x25: DABT (current EL), IL = 32 bits
[  564.482793]   SET = 0, FnV = 0
[  564.482794]   EA = 0, S1PTW = 0
[  564.482796]   FSC = 0x04: level 0 translation fault
[  564.482798] Data abort info:
[  564.482800]   ISV = 0, ISS= 0x00000004, ISS2 = 0x00000000
[  564.482802]   CM = 0, WnR = 0, TnD = 0, TagAccess = 0
[  564.482804]   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
[  564.482806] [0068e851806854c4] address between user and kernel address ranges
[  564.482809] Internal error: Oops: 0000000096000004 [#1] PREEMPT SMP
[  564.482812] Modules linked in: ip6table_filter ip6_tables iptable_filter bridge stp llc cfg80211 rfkill crct10dif_ce rk805_pwrkey hantro_vpu pwm_fan v4l2_vp9 v4l2_h264 v4l2_mem2mem videobuf2_dma_contig videobuf2_memops videobuf2_v4l2 rockchip_thermal videodev videobuf2_common mc fuse ip_tables x_tables ipv6 dm_thin_pool dm_persistent_data dm_bufio dm_bio_prison libcrc32c dm_mod dwmac_rk stmmac_platform stmmac rtc_hym8563 phy_rockchip_naneng_combphy pcs_xpcs rockchipdrm nvme analogix_dp dw_hdmi cec dw_hdmi_qp dw_mipi_dsi drm_display_helper drm_dma_helper drm_kms_helper nvme_core drm backlight
[  564.482878] CPU: 4 PID: 277 Comm: systemd-journal Not tainted 6.9.0-rc1-g99fc9cef1176-dirty #1
[  564.482882] Hardware name: Turing Machines RK1 (DT)
[  564.482884] pstate: 20400009 (nzCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
Copy code
[  564.482889] pc : __d_lookup_rcu+0x4c/0xf8
[  564.482901] lr : lookup_fast+0x34/0x144
[  564.482908] sp : ffff800084343a90
[  564.482909] x29: ffff800084343a90 x28: ffff800084343c80 x27: 0000000000000000
[  564.482914] x26: 2f2f2f2f2f2f2f2f x25: d0d0d0d0d0d0d0d0 x24: ffff800084343c80
[  564.482919] x23: fefefefefefefeff x22: ffff0001068a8026 x21: ffff00010777d000
[  564.482923] x20: ffff800084343c80 x19: ffff800084343c80 x18: 0000000000000000
[  564.482927] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000
[  564.482931] x14: 0000000000000000 x13: ffff0001068a8021 x12: ffff800084343cc4
[  564.482936] x11: 000000046a4d2343 x10: 000ffffffffffff8 x9 : 0000000000000004
[  564.482940] x8 : ffff00010777d000 x7 : e6e8e2ffb3bfa2a0 x6 : 0000000000210000
[  564.482945] x5 : ffff800081b72000 x4 : 00000000001a9348 x3 : ffff0007da200000
[  564.482949] x2 : 0000000000000004 x1 : 0268e851806854c8 x0 : 0268e851806854c8
[  564.482953] Call trace:
[  564.482955]  __d_lookup_rcu+0x4c/0xf8
[  564.482961]  walk_component+0x28/0x190
[  564.482964]  link_path_walk.part.0.constprop.0+0x294/0x394
[  564.482968]  path_openat+0xa8/0xef4
[  564.482971]  do_filp_open+0x9c/0x14c
[  564.482974]  do_sys_openat2+0xc0/0xf4
[  564.482979]  __arm64_sys_openat+0x64/0xa4
[  564.482983]  invoke_syscall+0x48/0x114
[  564.482989]  el0_svc_common.con
but it didnt crash on your ubuntu 22
c
Different set of loaded modules, perhaps?
s
oops, that was on 6.9.0
forgot to uninstall
this is the 6.7.0 crash
Copy code
[  460.507171] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
[  460.515737] CPU: 6 PID: 1 Comm: systemd Not tainted 6.7.0+ #1
[  460.522153] Hardware name: Turing Machines RK1 (DT)
[  460.527597] Call trace:
[  460.530321]  dump_backtrace+0x98/0x118
[  460.534506]  show_stack+0x18/0x24
[  460.538202]  dump_stack_lvl+0x74/0xc0
[  460.542289]  dump_stack+0x18/0x24
[  460.545984]  panic+0x3b4/0x3f0
[  460.549384]  do_exit+0x8cc/0x9b8
[  460.552984]  do_group_exit+0x34/0x90
[  460.556972]  get_signal+0x954/0x97c
[  460.560862]  do_notify_resume+0x298/0x1400
[  460.565432]  el0_da+0x8c/0x90
[  460.568733]  el0t_64_sync_handler+0xb8/0x12c
[  460.573489]  el0t_64_sync+0x1a4/0x1a8
[  460.577575] SMP: stopping secondary CPUs
[  460.582047] Kernel Offset: disabled
[  460.585936] CPU features: 0x1,80000000,70028146,2100720b
[  460.591865] Memory Limit: none
[  460.595271] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b ]---
so the bootloaders differ, but i can't imagine that would be the cause... right?
c
Bootloader contains the DRAM initialization/training code, and a bug in that would affect RAM stability. Did you already try running
memtester
just to eliminate RAM problems as the culprit?
s
ooo
s
Nope. Can do
s
Ive been dealing with this stuff pretty much all day
s
Oh?
s
I have a new rk3588 board i can't name, but they upgraded the ram to ddr5. It requires the latest SPL and DDR blobs, it may be worth a shot to rebuild u-boot with the updated blobs, let me grab a link to em
Updating the blobs also help ddr4 boards with stability, i forget where the changelogs are
s
well well well
i think it was that
yep
ahhh its finally stable 🥹
s
Awsome news!
I wish those blobs where open source, I'd love to investigate the ram training process in depth
s
Same
My build did use those new experimental open source ones
s
Ahhh that is probably why there where issues
s
im back 🙂
i made a change on all 4 of my rk1s, the change was enabling the gpu in the device tree and enabling the fan curve. pretty basic. on 3 of my 4 devices its all great, but one of them is having trouble coming back up https://cdn.discordapp.com/attachments/1223055516104659026/1256343847332745246/message.txt?ex=66806ce2&is=667f1b62&hm=ad510774f39e4a9bdfcb7c3cf22f0a2fd8c50326acd659a9df2d013fa70d1eb4&
any hints debugging this one?
heres the patch
here's an associated panic
c
Just to confirm: that fourth RK1 starts behaving again if you rollback the change, right?
Also on the 3 that boot happily, are these lines present in the log?
Copy code
[  140.323832] thermal_sys: Failed to bind 'package-thermal' with 'pwm-fan': -17
[  143.054645] rockchip-pm-domain fd8d8000.power-management:power-controller: failed to set domain 'gpu', val=0
[  145.762181] rockchip-pm-domain fd8d8000.power-management:power-controller: failed to get ack on domain 'gpu', val=0x1bffff
Since the traceback tells me that the hardware(?!) is balking at a register update while trying to power "something" on (and I strongly believe that "something" is the GPU)
s
thermal_sys is present in the log on the other 3
the latter 2 lines are not
checking on rolling back the kernel, i think i have it somewhere
have to boot into emmc and chroot to reinstall/revert kernel
oh i have 6.9 still installed
back to 6.9 (without the fan and gpu enabled) boots as expected
i suppose that would be without the panthor driver too
c
The whole panic seems to happen from here to line 555:
rockchip_do_pmu_set_power_domain
is trying to tell the PMU to provide power to the GPU, and timing out while waiting for the PMU to confirm that it's powered (this is the
failed to set domain 'gpu', val=0
)
Execution continues to
rockchip_pmu_set_idle_request
to try to take the GPU out of "idle mode" (I guess this supplies it with clocks?) but the PMU never acknowledges that request either (
failed to get ack on domain 'gpu', val=0x1bffff
), probably because the GPU was never actually powered up
Finally it gets to
rockchip_pmu_restore_qos
which tries to set some registers on the GPU('s bus controller) itself, but the bus sends back an error because the GPU isn't powered, and when the bus error gets back to the CPU it appears as a panic
So the real mystery is why the PMU isn't cooperating when the CPU asks to power up the GPU
s
FYI the kernel that it panicked on was 6.10-rc1
Given that it’s only one one of my 4, it sorta sounds like a hardware issue
This one has also been having issues bringing up one of my hard drives
c
It does. I know the GPU power is actually supplied off-chip (it's one of the voltages supplied by the RK805) so I'm wondering if the RK3588 is just waiting indefinitely for that power to show up.
s
on the other 3:
Copy code
cat /sys/firmware/devicetree/base/gpu\@fb000000/status
okay
Copy code
ls /dev/dri/
by-path  card0  renderD128
c
This is the bit in the PMU that it times out waiting for. I don't know what "repair" is but it might actually require that power is arriving and not merely "enabled." https://cdn.discordapp.com/attachments/1223055516104659026/1256385003957522452/image.png?ex=66809337&is=667f41b7&hm=3293a2426eb30488f0d579a46d5246609f2b08ebec2badd0ccb4129566014b62&
s
@DhanOS (Daniel Kukiela) might have an idea?
it wouldnt be a slot issue on the tpi2, right?
i can try swapping them later
but wait, this makes no sense
because it worked fine on my 6.10 kernel with the gpu enabled until i installed the one with the pwm change
i had set the gpu to be enabled without touching the fan in my previous build
c
So you had a build that worked on that node?
s
Yeah but I tried reverting to it and it failed to boot with the same errors
Hm maybe I didn’t upgrade that one
That one has the majority of my VMs and containers
c
If you're feeling hardcore, here's a picture of the RK1 with the cooler removed and pins 49-51 of the RK806 (PMIC) circled in red: https://cdn.discordapp.com/attachments/1223055516104659026/1256393464682123345/image.png?ex=66809b18&is=667f4998&hm=d21d0b1422aa0d86125beb1944ba54a97d4a401fcb8e376b56e50dbbe3e373ce&
So if you probe pin 49 with a voltmeter (please be careful not to short pins 49-50, though 49-48 might be fine if shorted by your probe) you should see it holding steady at 550-950mV when the GPU power is enabled. If not, that's pretty clear evidence of a hardware issue.
s
Oof this one will be hard given it has to be in the tpi2
Would this have to be before it panics?
c
It probably stays enabled even after it panics. Though before doing that, maybe check that beefy inductor (L10) near that corner, since that looks like it's the main inductor for the GPU power converter. I'd first check it visually and then with a continuity tester, just to make sure the wire isn't broken.
s
Will do
i havent had a chance to check that hardware yet, but removin the gpu enable from the device tree allows it to boot
update: it seems like a hardware issue now. i just rebooted it after operating normally since the previous message, and its panicing on boot. nothing changed.
Copy code
[    4.472806] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000008
[    4.482717] Mem abort info:
[    4.485846]   ESR = 0x0000000096000004
[    4.490059]   EC = 0x25: DABT (current EL), IL = 32 bits
[    4.496022]   SET = 0, FnV = 0
[    4.499455]   EA = 0, S1PTW = 0
[    4.502976]   FSC = 0x04: level 0 translation fault
[    4.508451] Data abort info:
[    4.511688]   ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000
[    4.517843]   CM = 0, WnR = 0, TnD = 0, TagAccess = 0
[    4.523514]   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
[    4.529476] [0000000000000008] user address but active_mm is swapper
[    4.536606] Internal error: Oops: 0000000096000004 [#1] PREEMPT SMP
[    4.543628] Modules linked in:
[    4.547054] CPU: 3 PID: 31 Comm: cpuhp/3 Not tainted 6.10.0-rc1+ #4
[    4.554076] Hardware name: Turing Machines RK1 (DT)
[    4.559529] pstate: a0400009 (NzCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[    4.567331] pc : blk_mq_hctx_notify_online+0x34/0xb0
[    4.572903] lr : cpuhp_invoke_callback+0x2c4/0x560
[    4.578278] sp : ffff80008249bd50
[    4.581990] x29: ffff80008249bd50 x28: ffff800081da9000 x27: 0000000000000000
[    4.589999] x26: 00000000000000ec x25: ffff0007fbf26c78 x24: 00000000000002f3
[    4.598006] x23: ffff8000807d07cc x22: ffff0007fbf26ca0 x21: ffff000101c03978
[    4.606012] x20: 0000000000000097 x19: ffff000101c03800 x18: ffff80008249bc78
[    4.614018] x17: 000000040044ffff x16: 005000f2b5503510 x15: 0000000000000000
[    4.622023] x14: ffff8000813b11a8 x13: ffffffffffffffff x12:000000034 x7 : ffff000100e8fe08 x6 : ffff000101aa5200
[    4.646021] x5 : ffff800081dad000 x4 : ffff0007fbf26ca0 x3 : 0000000000000003
[    4.654025] x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff000108496698
[    4.662031] Call trace:
[    4.66-
[   14.356782] platform a40000000.pcie: deferred probe pt: deferred probe pending: (reason unknown)
i just wrote a fresh image to emmc and it paniced. writing the same to another to see what happens
paniced? panicked? whatever lol
yeah booted fine. think this one is bricked
oooooooh now this is weird. i moved another rk1 into this one's slot and now it is panicing
ok i swapped this rk1 with another that was known working. the one previously panicking no longer panics on the other slot. the one previously working fine now panicks
@CFSworks could it be a slot issue?
before node1 - slot 1: good node3 - slot 3: panic after node1 - slot3: panic node3 - slot1: good
ironically now slot2/node2 is panicking
c
Have you tried moving around whatever is in the M.2 slots? This might be the PCIe CLKREQ# issue manifesting yet again.
s
all m.2 have nvme slots. i did try emmc booting though, so wouldnt that negate that?
emmc still panicked
hmm actually i wonder
Copy code
** File not found ubootefi.var **
Failed to load EFI variables
** Unable to write file ubootefi.var **
Failed to persist EFI variables
** Unable to write file ubootefi.var **
Failed to persist EFI variables
** Unable to write file ubootefi.var **
Failed to persist EFI variables
  0  efi_mgr      ready   (none)       0  <NULL>                   
** Booting bootflow '<NULL>' with efi_mgr
Loading Boot0000 'mmc 0' failed
EFI boot manager: Cannot load any image
Boot failed (err=-14)
pcie_dw_rockchip pcie@fe180000: PCIe-0 Link Fail
** Unable to write file ubootefi.var **
Failed to persist EFI variables
k#1.bootdev.part_3' with extlinux
on a side note, this was a very good test of my backup strategy lmao.
8 Views