Upstream Kernel Panics Turing Pi #│forum

Upstream Kernel Panics

soxrok2212

03/28/2024, 11:46 PM

Hi, I'm aware this isn't an officially supported build, but I created an RK1 debian build based on https://gitlab.collabora.com/hardware-enablement/rockchip-3588 on both kernels 6.8 and 6.9-rc1. They install and run and I'm able to run an Ubuntu 22 LXC container from an lvm-thin volume on my SSD. The bug: seems to be triggered by disk writes. I was building an OpenWrt 23.05.3 image utilizing all 8 cores with 16GB RAM and about 20 minutes in, I was hit with a panic. It's reproducable, happens every time and happened on multiple different chips so I don't think it's hardware related. I've attached the crash log. Pinging @Spooky and @CFSworks for visibility https://cdn.discordapp.com/attachments/1223055516104659026/1223055516666957955/message.txt?ex=66187636&is=66060136&hm=84f378bf34b28ef383e5f19a68ceba69dfe42bacc8e7adca829ac398ca92959e&

soxrok2212

03/28/2024, 11:46 PM

I can provide the exact image if needed as well

CFSworks

03/29/2024, 12:13 AM

This really looks like some kind of memory corruption in the inode cache.

soxrok2212

03/29/2024, 12:16 AM

i could try kernel 6.7

soxrok2212

03/29/2024, 12:17 AM

could it be a clock thing?

CFSworks

03/29/2024, 12:19 AM

It happens way too consistently in the inode cache for me to suspect clocks or other memory/hardware management. A memory test is never a bad idea, but this feels like the kernel is doing some kind of memory unsafety thing.

soxrok2212

03/29/2024, 12:19 AM

wouldn't it be not unique to a particular device then?

soxrok2212

03/29/2024, 12:20 AM

e.g. if i build the same image for x86 i should see it there

soxrok2212

03/29/2024, 12:20 AM

i feel like someone wouldve reported it already if so

CFSworks

03/29/2024, 12:22 AM

Hard to say. I remember hearing about an issue like this that was caused by invoking UB and the compiler for only a particular architecture was producing errant code. But you're right that this should be reported elsewhere, because it's not like AArch64 is a "niche" platform.

CFSworks

03/29/2024, 12:24 AM

I'd see how consistent the

[ 3392.667905] Unable to handle kernel paging request at virtual address 0000018001ffff78

is. What's interesting about that to me is:

Copy code

@ rasm2 -d -b64 -aarm -e 'f94002d3 b4000173 d1036273 b4000133 f9402263'
ldr x19, [x22]
cbz x19, 0x30
sub x19, x19, 0xd8
cbz x19, 0x30
ldr x3, [x19, 0x40]
@ hex(0x18000000010-0xd8+0x40)
'0x17fffffff78'

CFSworks

03/29/2024, 12:25 AM

i.e. what's being loaded from x22 is

0x18000000010

which looks more like a flags field to me, and then pointer arithmetic is being done on that

CFSworks

03/29/2024, 12:25 AM

Oh wait, messed up my arithmetic, sec

CFSworks

03/29/2024, 12:26 AM

Copy code

@ hex(0x18002000010-0xd8+0x40)
'0x18001ffff78'

CFSworks

03/29/2024, 12:26 AM

But still, an integer value with only 4 bits sparsely set screams "bitfield" to me.

soxrok2212

03/29/2024, 12:31 AM

hmmmm

CFSworks

03/29/2024, 12:31 AM

well that makes it hard to see what's going on with the struct management. Maybe you just got very (un)lucky (depending on whether you consider being the first to discover a memory error "lucky") with the layout randomization. Do you happen to have

CONFIG_RANDSTRUCT_*

set?

soxrok2212

03/29/2024, 12:32 AM

lemme look

soxrok2212

03/29/2024, 12:37 AM

Copy code

$ zcat /proc/config.gz | grep CONFIG_RANDSTRUCT
CONFIG_RANDSTRUCT_NONE=y

soxrok2212

03/29/2024, 12:38 AM

yep

CFSworks

03/29/2024, 12:39 AM

"None" would mean the randomization is disabled, so that's good for debugging (but there goes my previous guess)

soxrok2212

03/29/2024, 12:39 AM

soxrok2212

03/29/2024, 12:39 AM

wouldve help if i read it lmao

soxrok2212

03/29/2024, 12:41 AM

wonder if i scuffed something with my kernel config

soxrok2212

03/29/2024, 12:41 AM

https://cdn.discordapp.com/attachments/1223055516104659026/1223069570147942410/message.txt?ex=6618834d&is=66060e4d&hm=747ed0dacaf5af84ec167942dd5e6f7a5c7303a83087988330e157b17b0eadf1&

soxrok2212

03/29/2024, 12:41 AM

theres the whole kernel config

CFSworks

03/29/2024, 12:42 AM

Ah, that you're on such an "atypical" config that the bug is happening?

soxrok2212

03/29/2024, 12:43 AM

https://gitlab.collabora.com/hardware-enablement/rockchip-3588/linux/-/blob/rk3588/arch/arm64/configs/defconfig?ref_type=heads

soxrok2212

03/29/2024, 12:43 AM

this defconfig

soxrok2212

03/29/2024, 12:43 AM

added

CONFIG_DM_THIN_PROVISIONING=m

to get lvm-thin

soxrok2212

03/29/2024, 12:43 AM

then

Copy code

ARCH=arm64 CROSS_COMPILE=aarch64-linux-gnu- make olddefconfig
ARCH=arm64 CROSS_COMPILE=aarch64-linux-gnu- make bindeb-pkg -j $(nproc)

CFSworks

03/29/2024, 12:49 AM

I've never used it and I'm taking a shot in the dark, but try enabling

CONFIG_KASAN

and rebuilding the kernel. There'll likely be a performance cost to this, but with any luck it'll identify exactly where the corruption is happening.

soxrok2212

03/29/2024, 12:51 AM

ok, gimme an hour or so to build and install

soxrok2212

03/29/2024, 1:11 AM

built, installing and triggering bug

CFSworks

03/29/2024, 1:12 AM

Here's hoping it isn't a heisenbug

soxrok2212

03/29/2024, 1:12 AM

ok apparently i bricked it

CFSworks

03/29/2024, 1:13 AM

Won't boot at all?

soxrok2212

03/29/2024, 1:13 AM

stuck in initramfs

CFSworks

03/29/2024, 1:14 AM

Unable to load kernel modules due to a version skew? Or is there a KASAN problem flagged as early as initramfs?

soxrok2212

03/29/2024, 1:15 AM

not sure, lemme reinstall

soxrok2212

03/29/2024, 1:44 AM

almost done

soxrok2212

03/29/2024, 1:58 AM

alright building openwrt,,,

soxrok2212

03/29/2024, 2:16 AM

i liked, just starting the build now

soxrok2212

03/29/2024, 2:17 AM

mkay it wasnt even started and crash

soxrok2212

03/29/2024, 2:17 AM

https://cdn.discordapp.com/attachments/1223055516104659026/1223093677044273162/message.txt?ex=661899c0&is=660624c0&hm=733afd7addc392fa3be0a207a41068cc1411d48cb3c0cac8eff7a1d6eaf40d6e&

soxrok2212

03/29/2024, 2:59 AM

another https://cdn.discordapp.com/attachments/1223055516104659026/1223104216675909673/message.txt?ex=6618a391&is=66062e91&hm=4194a09df7211e5d2801d9ac35f78bb417563a7355677192ed99c3d1edcd4012&

soxrok2212

03/29/2024, 3:00 AM

does this help @CFSworks

CFSworks

03/29/2024, 3:53 AM

It's going to take some studying to understand this. It doesn't look like KASAN was triggered but rather that KASAN's memory-tracking code caused a crash earlier in the execution.

soxrok2212

03/29/2024, 3:54 AM

🙂

soxrok2212

03/29/2024, 3:54 AM

thats not epic

CFSworks

03/29/2024, 3:55 AM

I do think this crash is related to the same memory corruption, and earlier is always better since it's closer to the culprit.

soxrok2212

03/29/2024, 3:56 AM

anything i can provide or attempt or change?

soxrok2212

03/29/2024, 3:56 AM

this is a bit outside of my forte

CFSworks

03/29/2024, 3:59 AM

I always feel like I'm flying by the seat of my pants with this kind of debugging too. I've never seen the same kind of problem twice.

soxrok2212

03/29/2024, 3:59 AM

i sent a message to the collabora guys too, but im doubtful they will reply

CFSworks

03/29/2024, 4:02 AM

Here's an odd idea, but what about disabling cores 1-7 and only running core 0?

CFSworks

03/29/2024, 4:03 AM

Adding

maxcpus=1

to the command line on boot ought to achieve that.

CFSworks

03/29/2024, 4:04 AM

If the issue goes away entirely, then we know it's a data race. If it doesn't go away, it should get easier to debug. I'm wondering if this is corruption in "shared" caches and another core happens to trip over the corruption before the culprit thread is caught.

soxrok2212

03/29/2024, 4:05 AM

maxcpus=1 in uboot?

CFSworks

03/29/2024, 4:05 AM

In the kernel

bootargs

, but yes set from U-Boot

soxrok2212

03/29/2024, 4:05 AM

i can limit it in lxc too

CFSworks

03/29/2024, 4:05 AM

As long as it appears in

cat /proc/cmdline

, it should be good.

soxrok2212

03/29/2024, 4:06 AM

and pin it to one core

soxrok2212

03/29/2024, 4:06 AM

prob preferred in the kernel tho

CFSworks

03/29/2024, 4:06 AM

The cmdline option should prevent the other cores from being powered up at all. But a good second test might be to use core affinity to achieve the same thing. 🤔

soxrok2212

03/29/2024, 4:07 AM

k hang on

soxrok2212

03/29/2024, 4:10 AM

KASAN makes the kernel take 5ever to start lol

CFSworks

03/29/2024, 4:12 AM

Yeah, and it's gonna be a lot worse when on a single core.

soxrok2212

03/29/2024, 4:13 AM

this uboot is weird

soxrok2212

03/29/2024, 4:14 AM

setenv bootargs 'maxcpus=1'

CFSworks

03/29/2024, 4:14 AM

That should be it, but does your /boot have a script that overrides the bootargs?

soxrok2212

03/29/2024, 4:14 AM

probably

soxrok2212

03/29/2024, 4:14 AM

cause it didnt work lol

soxrok2212

03/29/2024, 4:14 AM

either that or something in the bootcmd overwrites it

soxrok2212

03/29/2024, 4:15 AM

actually, theres the initrd, sysmap, kernel and the extboot config

CFSworks

03/29/2024, 4:15 AM

extboot config might be interesting to look at. The others aren't part of that part of the boot chain.

CFSworks

03/29/2024, 4:15 AM

But yes, also studying the bootcmd sounds good.

soxrok2212

03/29/2024, 4:16 AM

extlinux* sorry

soxrok2212

03/29/2024, 4:16 AM

bootcmd=bootflow scan -lb

soxrok2212

03/29/2024, 4:16 AM

bootflow is new to me

soxrok2212

03/29/2024, 4:16 AM

oh hang on theres a boot script

Copy code

scriptaddr=0x00c00000

CFSworks

03/29/2024, 4:16 AM

CFSworks

03/29/2024, 4:16 AM

extlinux/extlinux.conf would be the next file that bootflow looks at

soxrok2212

03/29/2024, 4:17 AM

it just looks like a grub config

soxrok2212

03/29/2024, 4:17 AM

bootmenu

soxrok2212

03/29/2024, 4:18 AM

Copy code

## /boot/extlinux/extlinux.conf
##
## IMPORTANT WARNING
##
## The configuration of this file is generated automatically.
## Do not edit this file manually, use: u-boot-update

default l0
menu title U-Boot menu
prompt 0
timeout 50


label l0
    menu label Debian GNU/Linux 12 (bookworm) 6.8.0-g235e32bb9813-dirty
    linux /boot/vmlinuz-6.8.0-g235e32bb9813-dirty
    initrd /boot/initrd.img-6.8.0-g235e32bb9813-dirty
    fdtdir /usr/lib/linux-image-6.8.0-g235e32bb9813-dirty/
    
    append root=UUID=afb0e1eb-b0ad-4b79-b9a2-8354818b3b63 rootwait

label l0r
    menu label Debian GNU/Linux 12 (bookworm) 6.8.0-g235e32bb9813-dirty (rescue target)
    linux /boot/vmlinuz-6.8.0-g235e32bb9813-dirty
    initrd /boot/initrd.img-6.8.0-g235e32bb9813-dirty
    fdtdir /usr/lib/linux-image-6.8.0-g235e32bb9813-dirty/
    append root=UUID=afb0e1eb-b0ad-4b79-b9a2-8354818b3b63 rootwait single

(6.8.0 on my other node)

soxrok2212

03/29/2024, 4:19 AM

prob the append line

CFSworks

03/29/2024, 4:19 AM

I'd disregard the warning and add

maxcpus=1

to the

append

for now

soxrok2212

03/29/2024, 4:19 AM

yep

soxrok2212

03/29/2024, 4:19 AM

hang on

soxrok2212

03/29/2024, 4:20 AM

https://cantcrack.me/images/image-rockchip-turing-rk1-rk3588.img.gz if you're so inclined to try it yourself

soxrok2212

03/29/2024, 4:26 AM

alright were online with one core

soxrok2212

03/29/2024, 4:26 AM

core 0

CFSworks

03/29/2024, 4:26 AM

Have you also double-checked with

/proc/cpuinfo

soxrok2212

03/29/2024, 4:26 AM

Copy code

$ cat /proc/cpuinfo
processor    : 0
BogoMIPS    : 48.00
Features    : fp asimd evtstrm crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp
CPU implementer    : 0x41
CPU architecture: 8
CPU variant    : 0x2
CPU part    : 0xd05
CPU revision    : 0

soxrok2212

03/29/2024, 4:26 AM

yep

CFSworks

03/29/2024, 4:26 AM

Cool. If nothing else I hope this makes the oops/panic output more legible

soxrok2212

03/29/2024, 4:28 AM

this may take a while, so im going to monitor the chip's uart from a tmux session

soxrok2212

03/29/2024, 4:28 AM

might be asleep by the time it crashes (if it does)

CFSworks

03/29/2024, 4:28 AM

👍

soxrok2212

03/29/2024, 4:28 AM

appreciate the debugging help 🙂

CFSworks

03/29/2024, 4:28 AM

I'm in a kernel debugging mood this week anyway

CFSworks

03/29/2024, 4:29 AM

I'm multitasking this and also tracking down a crash in the experimental NVK open-source NVIDIA Vulkan driver at the same time

soxrok2212

03/29/2024, 4:30 AM

oh ive heard of that driver

soxrok2212

03/29/2024, 4:30 AM

my buddy was telling me about it

CFSworks

03/29/2024, 4:31 AM

It's definitely not ready to be a "daily driver" but I'm impressed with how well it works already.

soxrok2212

03/29/2024, 4:38 AM

yea thats what im hearing

soxrok2212

03/29/2024, 4:38 AM

i heard the intel gpu drivers are actually decent too

CFSworks

03/29/2024, 4:38 AM

Those very much are, they're my primary driver.

soxrok2212

03/29/2024, 4:39 AM

transcode capabilities are awesome too

soxrok2212

03/29/2024, 4:56 AM

other node just crashed doing a tarball extract

CFSworks

03/29/2024, 5:08 AM

Other node meaning one with all 8 cores enabled?

soxrok2212

03/29/2024, 5:10 AM

yea

soxrok2212

03/29/2024, 5:10 AM

single core one is still churning

CFSworks

03/29/2024, 5:13 AM

If the single core one doesn't die, I might start to suspect a cache coherency issue. The one big controversy with the RK3588 is it doesn't implement cache snooping on its interconnect (no idea if that includes caches within a single core cluster, or not). If this doesn't end in a crash, I'm wondering if there's some cache management issue that's masked by all of the platforms that do implement snooping.

CFSworks

03/29/2024, 5:13 AM

(And the cache management issue is apparently unique to the ext4 code.)

soxrok2212

03/29/2024, 5:14 AM

sounds very suspect the way you describe it

soxrok2212

03/29/2024, 5:14 AM

theres often cma warnings too

soxrok2212

03/29/2024, 5:15 AM

unsure if theyre related

soxrok2212

03/29/2024, 1:02 PM

I may have just reset the wrong machine… let me re-run everything 😰

soxrok2212

03/29/2024, 7:29 PM

so i think it still crashed but i didnt see anything in the logs

soxrok2212

03/29/2024, 7:29 PM

im going to try again

soxrok2212

03/30/2024, 2:19 AM

https://cdn.discordapp.com/attachments/1223055516104659026/1223456440887283833/message.txt?ex=6619eb9a&is=6607769a&hm=41f2fa289ee60700faa6b6355d7ba92450046d694354f9538b192dbe55344439&

soxrok2212

03/30/2024, 2:19 AM

a day later, i am positive it still crashes even on one core

soxrok2212

03/30/2024, 2:19 AM

@CFSworks 🙂

CFSworks

03/30/2024, 2:25 AM

Definitely a kernel bug then. What's killing it, I have no idea. Seems like the first fault was in filesystem code though... the consistency of this happening in the filesystem doesn't seem like a coincidence.

soxrok2212

03/30/2024, 2:26 AM

Agreed

soxrok2212

03/30/2024, 2:28 AM

Maybe I should 6.7

CFSworks

03/30/2024, 2:28 AM

If you can find a version where it doesn't happen, you could do a git-bisect

soxrok2212

03/30/2024, 2:28 AM

Problem is I can only go back so far for this board

soxrok2212

03/30/2024, 2:29 AM

6.6 was LTS yeah?

CFSworks

03/30/2024, 2:30 AM

It was; you'll want to keep the .dtb from your latest build though, since the RK1 .dts landed in 6.7.

soxrok2212

03/30/2024, 7:35 PM

it might be related to lvm actually

CFSworks

03/30/2024, 8:11 PM

Changing to a different fs but keeping lvm sounds like a good test

soxrok2212

03/30/2024, 9:21 PM

First I’m going to try in vanilla Debian w/o proxmox

soxrok2212

03/31/2024, 1:00 AM

so far so good on 8c with vanilla debian (no proxmox).

soxrok2212

03/31/2024, 2:07 AM

ok, crashed on vanilla debian, so it is indeed a kernel issue

soxrok2212

03/31/2024, 2:09 AM

https://cdn.discordapp.com/attachments/1223055516104659026/1223816492026564608/message.txt?ex=661b3aed&is=6608c5ed&hm=b1da5f24cdc6f4fe9cfe222add47c8615ffce45c8986f8c5088b07df9c90692a&

soxrok2212

03/31/2024, 2:39 AM

alright ill build 6.7 now and try that

soxrok2212

03/31/2024, 3:06 AM

alright, openwrt compiling on 6.7. lets see if it crashes

soxrok2212

03/31/2024, 3:19 PM

crashed on 6.7 too

soxrok2212

03/31/2024, 3:19 PM

there has to be something else wrong, theres' no way a basic ext4 system would be crashing for 3 minor kernel versions

soxrok2212

04/01/2024, 12:52 AM

@CFSworks sure it’s nothing in the device tree?

soxrok2212

04/01/2024, 12:54 AM

It really almost seems like a clock now

CFSworks

04/01/2024, 1:00 AM

I haven't encountered anything on my end, but that doesn't mean it's definitely error-free. What makes it seem like a clock?

soxrok2212

04/01/2024, 1:01 AM

I read something on the collabora git about certain clocks being unstable

soxrok2212

04/01/2024, 1:01 AM

Let me see if I can find it

soxrok2212

04/01/2024, 1:01 AM

It wouldn’t be a missing kernel module right? Since it technically “works”

CFSworks

04/01/2024, 1:05 AM

It doesn't make sense for it to be a missing kernel module, no.

soxrok2212

04/01/2024, 1:09 AM

I can’t find that clk reference

soxrok2212

04/01/2024, 1:11 AM

Maybe I should try building the vanilla Linux kernel?

soxrok2212

04/01/2024, 1:11 AM

And not from collabora

soxrok2212

04/01/2024, 1:11 AM

Maybe they introduced a buggy patch

soxrok2212

04/01/2024, 1:12 AM

https://gitlab.collabora.com/hardware-enablement/rockchip-3588/linux/-/commit/26a101fd68fef930acb49e281ce4c328704918b4

soxrok2212

04/01/2024, 1:12 AM

Oh here’s one

soxrok2212

04/01/2024, 1:12 AM

That’s for rock-5b though

soxrok2212

04/01/2024, 1:14 AM

Ok lemme try vanilla kernel

soxrok2212

04/01/2024, 1:14 AM

Else, I’m out of ideas

CFSworks

04/01/2024, 1:21 AM

Also run a test with a non-ext4 filesystem. If the error still happens with, I dunno, xfs, then we know it's not ext4-related.

soxrok2212

04/01/2024, 3:21 AM

ok still crashed on pure vanilla kernel 6.9-rc2 tag

soxrok2212

04/01/2024, 3:21 AM

guess i need to try btrfs or xfs

soxrok2212

04/01/2024, 3:21 AM

sounds like a tomorrow problem

soxrok2212

04/01/2024, 12:59 PM

i think i need to re-build u-boot with btrfs support

soxrok2212

04/01/2024, 1:20 PM

interesting data point

soxrok2212

04/01/2024, 1:20 PM

i tried to dd an image from /tmp on my nvme to mmcblk0 and it panicked

soxrok2212

04/01/2024, 1:20 PM

Copy code

[ 1620.511576] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
[ 1620.520153] CPU: 6 PID: 1 Comm: systemd Not tainted 6.9.0-rc2 #1
[ 1620.526868] Hardware name: Turing Machines RK1 (DT)
[ 1620.532315] Call trace:
[ 1620.535042]  dump_backtrace+0x94/0xec
[ 1620.539144]  show_stack+0x18/0x24
[ 1620.542848]  dump_stack_lvl+0x38/0x90
[ 1620.546944]  dump_stack+0x18/0x24
[ 1620.550648]  panic+0x39c/0x3d0
[ 1620.554054]  do_exit+0x834/0x92c
[ 1620.557659]  do_group_exit+0x34/0x90
[ 1620.561650]  copy_siginfo_to_user+0x0/0xc8
[ 1620.566227]  do_signal+0x118/0x1378
[ 1620.570126]  do_notify_resume+0xc8/0x140
[ 1620.574508]  el0_undef+0x84/0x98
[ 1620.578113]  el0t_64_sync_handler+0xa0/0x12c
[ 1620.582884]  el0t_64_sync+0x190/0x194
[ 1620.586974] SMP: stopping secondary CPUs
[ 1620.591452] Kernel Offset: disabled
[ 1620.595344] CPU features: 0x4,00000003,80140528,4200720b
[ 1620.601280] Memory Limit: none
[ 1620.604690] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b ]---

soxrok2212

04/01/2024, 2:22 PM

alright, were on btrfs. lets see

soxrok2212

04/01/2024, 3:48 PM

https://cdn.discordapp.com/attachments/1223055516104659026/1224384850216358078/message.txt?ex=661d4c40&is=660ad740&hm=09840d11e3ab7140497723a92bb570f0efce771f564d4555b013d9930991d3ed&

soxrok2212

04/01/2024, 3:48 PM

@CFSworks btrfs crashed

soxrok2212

04/01/2024, 3:52 PM

this was on eMMC and not the nvme (ext crashes were all on nvme) so its not the drives/storage either

soxrok2212

04/01/2024, 4:07 PM

as proof

/dev/mmcblk0p3 on / type btrfs (rw,relatime,ssd,discard=async,space_cache=v2,subvolid=5,subvol=/)

soxrok2212

04/01/2024, 4:55 PM

maybe i need some of your patches? https://github.com/CFSworks/alpine-rk1/tree/master/main/linux-turing

CFSworks

04/01/2024, 6:50 PM

A btrfs crash means it's more likely to be hardware related, yeah. The only patch of mine that I can think related is already upstream, though. This is a tricky problem.

soxrok2212

04/01/2024, 8:00 PM

but its crashed on 3/4 of my devices

soxrok2212

04/01/2024, 8:00 PM

seems less likely hardware, no?

soxrok2212

04/01/2024, 8:00 PM

actually no, its crashed on all 4

soxrok2212

04/01/2024, 8:00 PM

one or two also on emmc

CFSworks

04/01/2024, 8:29 PM

Oh oops I guess I meant "driver related"

soxrok2212

04/01/2024, 8:40 PM

what driver(s) are you thinking?

soxrok2212

04/01/2024, 8:41 PM

probably not related to pcie

soxrok2212

04/01/2024, 8:41 PM

not mmc

CFSworks

04/01/2024, 8:59 PM

Could be eMMC, but I have no idea. Does this happen consistently whether you have NVMes installed or not?

soxrok2212

04/01/2024, 9:29 PM

I haven't tried remove nvme drives yet, but I have left 2 nodes on (running from nvme) with low IO and they've been up for 2-3 days straight now

soxrok2212

04/02/2024, 11:58 PM

I sent an email to collabora requesting a little assistance

Spooky

04/03/2024, 12:16 AM

Sorry for delay in chiming in, ubuntu has been busy with beta launch coming up. Do we have any theory on the issue?

soxrok2212

04/03/2024, 12:17 AM

Nope

soxrok2212

04/03/2024, 12:17 AM

I’ve tested just about everything I can test

soxrok2212

04/03/2024, 12:17 AM

Same behavior on all my nodes so it’s not hardware

soxrok2212

04/03/2024, 12:17 AM

Tried with one single core enabled

soxrok2212

04/03/2024, 12:17 AM

Tried on ext4 and btrfs

soxrok2212

04/03/2024, 12:18 AM

Tried collabora kernels 6.7, 6.8 and 6.9 as well as upstream 6.9-rc2

Spooky

04/03/2024, 12:18 AM

Hmm i have a 6.7 build, would you be able to see if it also crashes?

soxrok2212

04/03/2024, 12:18 AM

I did try on 6.7

Spooky

04/03/2024, 12:18 AM

Its not collabra tho

soxrok2212

04/03/2024, 12:18 AM

I’m willing to give it a shot though

Spooky

04/03/2024, 12:18 AM

Its more of an Armbian + my suff kernel

soxrok2212

04/03/2024, 12:18 AM

Yeah drop it here I’ll try it

Spooky

04/03/2024, 12:19 AM

Oh geez its 3 months old

Spooky

04/03/2024, 12:19 AM

https://github.com/Joshua-Riek/ubuntu-rockchip/actions/runs/7700353474

Spooky

04/03/2024, 12:19 AM

I keep getting pulled into the BSP kernel trap

CFSworks

04/03/2024, 12:19 AM

What are the common elements of all of the crashes so far? - On an RK1 - ~~Compiling OpenWrt~~ heavy filesystem access ~~to eMMC~~ to any external (network/block) target - ~~ext4 filesystem~~ - ~~LVM~~ - ~~Proxmox, not bare metal~~ - ~~Multiple CPU cores enabled~~

soxrok2212

04/03/2024, 12:20 AM

Compiling openwrt is just one way to crash it. I’ve been able to crash when copying a large file from an NFS share to another

soxrok2212

04/03/2024, 12:20 AM

As well as dd’ing a file from /tmp to mmcblk0

Spooky

04/03/2024, 12:20 AM

I have about 20 different rk3588 bords btw, if i can get the exact reproducible steps i can rule out that its RK1 specific

soxrok2212

04/03/2024, 12:20 AM

Screw the BSP kernel lol

Spooky

04/03/2024, 12:21 AM

6.1 is not that bad, but 5.10 was CURSED

CFSworks

04/03/2024, 12:21 AM

Can we eliminate the eMMC as the culprit, by doing heavy filesystem access to the NVMe instead?

soxrok2212

04/03/2024, 12:21 AM

@Spooky try to mount an NFS share or two and copy a large file between (like 20g+)

soxrok2212

04/03/2024, 12:21 AM

I’ve tested on both eMMC and NVME. Same result

CFSworks

04/03/2024, 12:21 AM

Does NFS->NFS trigger it?

soxrok2212

04/03/2024, 12:21 AM

Yes

soxrok2212

04/03/2024, 12:22 AM

https://cantcrack.me/images/image-rockchip-turing-rk1-rk3588.img.gz

soxrok2212

04/03/2024, 12:22 AM

Here’s my build

CFSworks

04/03/2024, 12:22 AM

How about mounting tmpfs and repeatedly doing

dd if=/dev/urandom of=/tmp/test.bin bs=4096 count=1024

in a tight loop?

Spooky

04/03/2024, 12:23 AM

Hmmm i need to power on my nas, it has an NFS server with a bunch of ubuntu build artifacts i can try transferring

CFSworks

04/03/2024, 12:24 AM

Can you trigger the crash by running

iperf3

or other non-filesystem I/O?

soxrok2212

04/03/2024, 12:24 AM

let me try iperf

Spooky

04/03/2024, 12:26 AM

Off topic: but with Panthor being merged in 6.10 we will finally have gpu support, ill likely send a patch for HDMI, but the edid quirk may take some time to figure out a proper patch.

soxrok2212

04/03/2024, 12:26 AM

im stoked for this

soxrok2212

04/03/2024, 12:27 AM

will try this after

Spooky

04/03/2024, 12:28 AM

Ive been talking to some of the GPU / VPU devs, they are insanely smart. I beleive AV1 support should be coming in soon.

soxrok2212

04/03/2024, 12:28 AM

also stoked for that, my whole media library is in av1

soxrok2212

04/03/2024, 12:30 AM

what exactly do you mean "mounting tmpfs" ?

Spooky

04/03/2024, 12:30 AM

Personally im waiting for 6.10 before doing any rebase on mainline. But im flashing a fresh 6.7 image now to test.

CFSworks

04/03/2024, 12:30 AM

tmpfs is just an in-RAM filesystem.

/tmp

is typically one.

mount | grep /tmp

to check its type

CFSworks

04/03/2024, 12:30 AM

(Sometimes it's not tmpfs, but a directory in

that gets cleared on boot)

soxrok2212

04/03/2024, 12:32 AM

Copy code

udev on /dev type devtmpfs (rw,nosuid,relatime,size=16245064k,nr_inodes=4061266,mode=755)
tmpfs on /run type tmpfs (rw,nosuid,nodev,noexec,relatime,size=3254680k,mode=755)
tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev)
tmpfs on /run/lock type tmpfs (rw,nosuid,nodev,noexec,relatime,size=5120k)
ramfs on /run/credentials/systemd-tmpfiles-setup-dev.service type ramfs (ro,nosuid,nodev,noexec,relatime,mode=700)
ramfs on /run/credentials/systemd-tmpfiles-setup.service type ramfs (ro,nosuid,nodev,noexec,relatime,mode=700)
tmpfs on /run/user/1001 type tmpfs (rw,nosuid,nodev,relatime,size=3254676k,nr_inodes=813669,mode=700,uid=1001,gid=1001)

soxrok2212

04/03/2024, 12:33 AM

i actually dont understand this one

soxrok2212

04/03/2024, 12:33 AM

Copy code

$ df /tmp
Filesystem     1K-blocks    Used Available Use% Mounted on
/dev/nvme0n1p3  32937936 3895364  27348492  13% /

CFSworks

04/03/2024, 12:33 AM

So /run and /dev/shm are tmpfs, but it looks like /tmp is on the NVMe

soxrok2212

04/03/2024, 12:33 AM

it is

soxrok2212

04/03/2024, 12:34 AM

it shouldn't be

CFSworks

04/03/2024, 12:34 AM

Sometimes that's done just to give /tmp more space than just RAM can handle

soxrok2212

04/03/2024, 12:34 AM

i think 32g is enough lol

CFSworks

04/03/2024, 12:35 AM

Well,

mount -t tmpfs /tmp /tmp

if you'd like to use RAM instead

soxrok2212

04/03/2024, 12:36 AM

iperf for 5 mins at 1gbit didnt crash it

soxrok2212

04/03/2024, 12:37 AM

lemme write to /tmp (in ram)

soxrok2212

04/03/2024, 12:41 AM

for i in $(seq 1 1024); do dd if=/dev/urandom of=/tmp/test.bin bs=4096 count=20480; done

soxrok2212

04/03/2024, 12:47 AM

seems to be okay so far

soxrok2212

04/03/2024, 12:53 AM

aha!

soxrok2212

04/03/2024, 12:54 AM

seems to be reading from disk

soxrok2212

04/03/2024, 12:54 AM

or reading in general

soxrok2212

04/03/2024, 12:54 AM

not writing

soxrok2212

04/03/2024, 12:55 AM

dd if=/dev/urandom of=/tmp/test.bin bs=4096 count=2048000 status=progress

soxrok2212

04/03/2024, 12:55 AM

then copy this somewhere else, e.g. to an nfs share

soxrok2212

04/03/2024, 12:55 AM

cp /tmp/test.bin /mnt/share/test.bin

soxrok2212

04/03/2024, 12:55 AM

crashed instantly

soxrok2212

04/03/2024, 12:57 AM

logs from this one

soxrok2212

04/03/2024, 12:57 AM

https://cdn.discordapp.com/attachments/1223055516104659026/1224885386212282368/message.txt?ex=661f1e69&is=660ca969&hm=25af6da249e5f3bb892821574745eb85a670aa9b5570576d8d2287dfedca4b9b&

CFSworks

04/03/2024, 12:59 AM

I wonder if you should test a lower patchlevel of each kernel minor you're testing

CFSworks

04/03/2024, 12:59 AM

It occurs to me that there might be a bad fix commit that got cherrypicked onto each of the 6.6-8 branches

CFSworks

04/03/2024, 1:00 AM

Maybe try out 6.7.1 first and go from there?

soxrok2212

04/03/2024, 1:00 AM

yep that is 100% it, just crashed again immediately

soxrok2212

04/03/2024, 1:00 AM

lemme check out 6.7.1

soxrok2212

04/03/2024, 2:11 AM

6.7.1 crashes

soxrok2212

04/10/2024, 8:46 PM

im trying @Spooky 's 6.7 ubuntu build now...

soxrok2212

05/03/2024, 1:36 PM

ok, @Spooky your ubuntu 22 build doesnt crash

soxrok2212

05/03/2024, 1:36 PM

so something is b0rked with my build 🙃

soxrok2212

05/05/2024, 9:48 PM

i installed @Spooky 's 6.7 kernel onto my debian 12 machine. still crashed when copying a file from an nfs share back to itself

Copy code

[  564.470798] Unable to handle kernel paging request at virtual address 0068e851806854c4
[  564.479679] Mem abort info:
[  564.481027] Unable to handle kernel paging request at virtual address 0058db8f59c78b97
[  564.482787]   ESR = 0x0000000096000004
[  564.482790]   EC = 0x25: DABT (current EL), IL = 32 bits
[  564.482793]   SET = 0, FnV = 0
[  564.482794]   EA = 0, S1PTW = 0
[  564.482796]   FSC = 0x04: level 0 translation fault
[  564.482798] Data abort info:
[  564.482800]   ISV = 0, ISS= 0x00000004, ISS2 = 0x00000000
[  564.482802]   CM = 0, WnR = 0, TnD = 0, TagAccess = 0
[  564.482804]   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
[  564.482806] [0068e851806854c4] address between user and kernel address ranges
[  564.482809] Internal error: Oops: 0000000096000004 [#1] PREEMPT SMP
[  564.482812] Modules linked in: ip6table_filter ip6_tables iptable_filter bridge stp llc cfg80211 rfkill crct10dif_ce rk805_pwrkey hantro_vpu pwm_fan v4l2_vp9 v4l2_h264 v4l2_mem2mem videobuf2_dma_contig videobuf2_memops videobuf2_v4l2 rockchip_thermal videodev videobuf2_common mc fuse ip_tables x_tables ipv6 dm_thin_pool dm_persistent_data dm_bufio dm_bio_prison libcrc32c dm_mod dwmac_rk stmmac_platform stmmac rtc_hym8563 phy_rockchip_naneng_combphy pcs_xpcs rockchipdrm nvme analogix_dp dw_hdmi cec dw_hdmi_qp dw_mipi_dsi drm_display_helper drm_dma_helper drm_kms_helper nvme_core drm backlight
[  564.482878] CPU: 4 PID: 277 Comm: systemd-journal Not tainted 6.9.0-rc1-g99fc9cef1176-dirty #1
[  564.482882] Hardware name: Turing Machines RK1 (DT)
[  564.482884] pstate: 20400009 (nzCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)

soxrok2212

05/05/2024, 9:48 PM

Copy code

[  564.482889] pc : __d_lookup_rcu+0x4c/0xf8
[  564.482901] lr : lookup_fast+0x34/0x144
[  564.482908] sp : ffff800084343a90
[  564.482909] x29: ffff800084343a90 x28: ffff800084343c80 x27: 0000000000000000
[  564.482914] x26: 2f2f2f2f2f2f2f2f x25: d0d0d0d0d0d0d0d0 x24: ffff800084343c80
[  564.482919] x23: fefefefefefefeff x22: ffff0001068a8026 x21: ffff00010777d000
[  564.482923] x20: ffff800084343c80 x19: ffff800084343c80 x18: 0000000000000000
[  564.482927] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000
[  564.482931] x14: 0000000000000000 x13: ffff0001068a8021 x12: ffff800084343cc4
[  564.482936] x11: 000000046a4d2343 x10: 000ffffffffffff8 x9 : 0000000000000004
[  564.482940] x8 : ffff00010777d000 x7 : e6e8e2ffb3bfa2a0 x6 : 0000000000210000
[  564.482945] x5 : ffff800081b72000 x4 : 00000000001a9348 x3 : ffff0007da200000
[  564.482949] x2 : 0000000000000004 x1 : 0268e851806854c8 x0 : 0268e851806854c8
[  564.482953] Call trace:
[  564.482955]  __d_lookup_rcu+0x4c/0xf8
[  564.482961]  walk_component+0x28/0x190
[  564.482964]  link_path_walk.part.0.constprop.0+0x294/0x394
[  564.482968]  path_openat+0xa8/0xef4
[  564.482971]  do_filp_open+0x9c/0x14c
[  564.482974]  do_sys_openat2+0xc0/0xf4
[  564.482979]  __arm64_sys_openat+0x64/0xa4
[  564.482983]  invoke_syscall+0x48/0x114
[  564.482989]  el0_svc_common.con

soxrok2212

05/05/2024, 9:48 PM

but it didnt crash on your ubuntu 22

CFSworks

05/05/2024, 10:08 PM

Different set of loaded modules, perhaps?

soxrok2212

05/05/2024, 10:16 PM

oops, that was on 6.9.0

soxrok2212

05/05/2024, 10:16 PM

forgot to uninstall

soxrok2212

05/05/2024, 10:16 PM

this is the 6.7.0 crash

Copy code

[  460.507171] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
[  460.515737] CPU: 6 PID: 1 Comm: systemd Not tainted 6.7.0+ #1
[  460.522153] Hardware name: Turing Machines RK1 (DT)
[  460.527597] Call trace:
[  460.530321]  dump_backtrace+0x98/0x118
[  460.534506]  show_stack+0x18/0x24
[  460.538202]  dump_stack_lvl+0x74/0xc0
[  460.542289]  dump_stack+0x18/0x24
[  460.545984]  panic+0x3b4/0x3f0
[  460.549384]  do_exit+0x8cc/0x9b8
[  460.552984]  do_group_exit+0x34/0x90
[  460.556972]  get_signal+0x954/0x97c
[  460.560862]  do_notify_resume+0x298/0x1400
[  460.565432]  el0_da+0x8c/0x90
[  460.568733]  el0t_64_sync_handler+0xb8/0x12c
[  460.573489]  el0t_64_sync+0x1a4/0x1a8
[  460.577575] SMP: stopping secondary CPUs
[  460.582047] Kernel Offset: disabled
[  460.585936] CPU features: 0x1,80000000,70028146,2100720b
[  460.591865] Memory Limit: none
[  460.595271] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b ]---

soxrok2212

05/05/2024, 10:21 PM

so the bootloaders differ, but i can't imagine that would be the cause... right?

CFSworks

05/05/2024, 11:32 PM

Bootloader contains the DRAM initialization/training code, and a bug in that would affect RAM stability. Did you already try running

memtester

just to eliminate RAM problems as the culprit?

Spooky

05/06/2024, 12:32 AM

ooo

soxrok2212

05/06/2024, 12:32 AM

Nope. Can do

Spooky

05/06/2024, 12:32 AM

Ive been dealing with this stuff pretty much all day

soxrok2212

05/06/2024, 12:32 AM

Oh?

Spooky

05/06/2024, 12:34 AM

I have a new rk3588 board i can't name, but they upgraded the ram to ddr5. It requires the latest SPL and DDR blobs, it may be worth a shot to rebuild u-boot with the updated blobs, let me grab a link to em

Spooky

05/06/2024, 12:35 AM

Updating the blobs also help ddr4 boards with stability, i forget where the changelogs are

Spooky

05/06/2024, 12:36 AM

Try to use these two blobs in u-boot: https://github.com/Joshua-Riek/ubuntu-rockchip/blob/main/packages/u-boot-orangepi-rk3588/debian/rkbin/rk3588_ddr_lp4_2112MHz_lp5_2736MHz_v1.15.bin https://github.com/Joshua-Riek/ubuntu-rockchip/blob/main/packages/u-boot-orangepi-rk3588/debian/rkbin/rk3588_bl31_v1.44.elf

Spooky

05/06/2024, 12:41 AM

Source is here btw: https://gitlab.com/rockchip_linux_sdk_6.1/rk/rkbin

soxrok2212

05/06/2024, 2:39 AM

well well well

soxrok2212

05/06/2024, 2:39 AM

i think it was that

soxrok2212

05/06/2024, 2:42 AM

yep

soxrok2212

05/06/2024, 4:09 AM

ahhh its finally stable 🥹

Spooky

05/06/2024, 11:08 AM

Awsome news!

Spooky

05/06/2024, 1:05 PM

I wish those blobs where open source, I'd love to investigate the ram training process in depth

soxrok2212

05/06/2024, 1:14 PM

Same

soxrok2212

05/06/2024, 1:14 PM

My build did use those new experimental open source ones

Spooky

05/06/2024, 1:16 PM

Ahhh that is probably why there where issues

soxrok2212

06/28/2024, 8:20 PM

im back 🙂

soxrok2212

06/28/2024, 8:21 PM

i made a change on all 4 of my rk1s, the change was enabling the gpu in the device tree and enabling the fan curve. pretty basic. on 3 of my 4 devices its all great, but one of them is having trouble coming back up https://cdn.discordapp.com/attachments/1223055516104659026/1256343847332745246/message.txt?ex=66806ce2&is=667f1b62&hm=ad510774f39e4a9bdfcb7c3cf22f0a2fd8c50326acd659a9df2d013fa70d1eb4&

soxrok2212

06/28/2024, 8:22 PM

any hints debugging this one?

soxrok2212

06/28/2024, 8:23 PM

heres the patch

soxrok2212

06/28/2024, 8:23 PM

https://cdn.discordapp.com/attachments/1223055516104659026/1256344211339481088/message.txt?ex=66806d39&is=667f1bb9&hm=4da73cfcb219fe8eb59d7579d53eee8d58a6060ed5f1ae19554377b2b6cc9771&

soxrok2212

06/28/2024, 8:34 PM

here's an associated panic

soxrok2212

06/28/2024, 8:34 PM

https://cdn.discordapp.com/attachments/1223055516104659026/1256346923246293115/message.txt?ex=66806fc0&is=667f1e40&hm=3ee996c3bddfddfe65ce37c2e1d9f584fc8a78bb2e1768207298920658c0dce0&

CFSworks

06/28/2024, 9:07 PM

Just to confirm: that fourth RK1 starts behaving again if you rollback the change, right?

CFSworks

06/28/2024, 9:08 PM

Also on the 3 that boot happily, are these lines present in the log?

Copy code

[  140.323832] thermal_sys: Failed to bind 'package-thermal' with 'pwm-fan': -17
[  143.054645] rockchip-pm-domain fd8d8000.power-management:power-controller: failed to set domain 'gpu', val=0
[  145.762181] rockchip-pm-domain fd8d8000.power-management:power-controller: failed to get ack on domain 'gpu', val=0x1bffff

CFSworks

06/28/2024, 9:09 PM

Since the traceback tells me that the hardware(?!) is balking at a register update while trying to power "something" on (and I strongly believe that "something" is the GPU)

soxrok2212

06/28/2024, 9:11 PM

thermal_sys is present in the log on the other 3

soxrok2212

06/28/2024, 9:11 PM

the latter 2 lines are not

soxrok2212

06/28/2024, 9:11 PM

checking on rolling back the kernel, i think i have it somewhere

soxrok2212

06/28/2024, 9:12 PM

have to boot into emmc and chroot to reinstall/revert kernel

soxrok2212

06/28/2024, 9:34 PM

oh i have 6.9 still installed

soxrok2212

06/28/2024, 9:38 PM

back to 6.9 (without the fan and gpu enabled) boots as expected

soxrok2212

06/28/2024, 9:39 PM

i suppose that would be without the panthor driver too

CFSworks

06/28/2024, 10:19 PM

The whole panic seems to happen from here to line 555:

CFSworks

06/28/2024, 10:20 PM

rockchip_do_pmu_set_power_domain

is trying to tell the PMU to provide power to the GPU, and timing out while waiting for the PMU to confirm that it's powered (this is the

failed to set domain 'gpu', val=0

)

CFSworks

06/28/2024, 10:21 PM

Execution continues to

rockchip_pmu_set_idle_request

to try to take the GPU out of "idle mode" (I guess this supplies it with clocks?) but the PMU never acknowledges that request either (

failed to get ack on domain 'gpu', val=0x1bffff

), probably because the GPU was never actually powered up

CFSworks

06/28/2024, 10:21 PM

Finally it gets to

rockchip_pmu_restore_qos

which tries to set some registers on the GPU('s bus controller) itself, but the bus sends back an error because the GPU isn't powered, and when the bus error gets back to the CPU it appears as a panic

CFSworks

06/28/2024, 10:22 PM

So the real mystery is why the PMU isn't cooperating when the CPU asks to power up the GPU

soxrok2212

06/28/2024, 11:01 PM

FYI the kernel that it panicked on was 6.10-rc1

soxrok2212

06/28/2024, 11:02 PM

Given that it’s only one one of my 4, it sorta sounds like a hardware issue

soxrok2212

06/28/2024, 11:02 PM

This one has also been having issues bringing up one of my hard drives

CFSworks

06/28/2024, 11:04 PM

It does. I know the GPU power is actually supplied off-chip (it's one of the voltages supplied by the RK805) so I'm wondering if the RK3588 is just waiting indefinitely for that power to show up.

soxrok2212

06/28/2024, 11:04 PM

on the other 3:

Copy code

cat /sys/firmware/devicetree/base/gpu\@fb000000/status
okay

soxrok2212

06/28/2024, 11:05 PM

Copy code

ls /dev/dri/
by-path  card0  renderD128

CFSworks

06/28/2024, 11:05 PM

This is the bit in the PMU that it times out waiting for. I don't know what "repair" is but it might actually require that power is arriving and not merely "enabled." https://cdn.discordapp.com/attachments/1223055516104659026/1256385003957522452/image.png?ex=66809337&is=667f41b7&hm=3293a2426eb30488f0d579a46d5246609f2b08ebec2badd0ccb4129566014b62&

soxrok2212

06/28/2024, 11:05 PM

@DhanOS might have an idea?

soxrok2212

06/28/2024, 11:06 PM

it wouldnt be a slot issue on the tpi2, right?

soxrok2212

06/28/2024, 11:06 PM

i can try swapping them later

soxrok2212

06/28/2024, 11:06 PM

but wait, this makes no sense

soxrok2212

06/28/2024, 11:06 PM

because it worked fine on my 6.10 kernel with the gpu enabled until i installed the one with the pwm change

soxrok2212

06/28/2024, 11:06 PM

i had set the gpu to be enabled without touching the fan in my previous build

CFSworks

06/28/2024, 11:13 PM

So you had a build that worked on that node?

soxrok2212

06/28/2024, 11:30 PM

Yeah but I tried reverting to it and it failed to boot with the same errors

soxrok2212

06/28/2024, 11:32 PM

Hm maybe I didn’t upgrade that one

soxrok2212

06/28/2024, 11:33 PM

That one has the majority of my VMs and containers

CFSworks

06/28/2024, 11:39 PM

If you're feeling hardcore, here's a picture of the RK1 with the cooler removed and pins 49-51 of the RK806 (PMIC) circled in red: https://cdn.discordapp.com/attachments/1223055516104659026/1256393464682123345/image.png?ex=66809b18&is=667f4998&hm=d21d0b1422aa0d86125beb1944ba54a97d4a401fcb8e376b56e50dbbe3e373ce&

CFSworks

06/28/2024, 11:39 PM

This is the relevant part of the Mixtile Core 3588E schematic (I fully believe it's the same as on the RK1): https://cdn.discordapp.com/attachments/1223055516104659026/1256393642923397190/image.png?ex=66809b43&is=667f49c3&hm=e83daa3bc390696afb653ee00b81f2a8412deb06127f514cf109cd40d4dc1bcd&

CFSworks

06/28/2024, 11:40 PM

Here's the pin assignment on the RK806: https://cdn.discordapp.com/attachments/1223055516104659026/1256393764486774805/image.png?ex=66809b60&is=667f49e0&hm=0f414a7c686f8b45cc66602c999b30a02a51a8d425f5fcf4f868565a37f7afe8&

CFSworks

06/28/2024, 11:42 PM

So if you probe pin 49 with a voltmeter (please be careful not to short pins 49-50, though 49-48 might be fine if shorted by your probe) you should see it holding steady at 550-950mV when the GPU power is enabled. If not, that's pretty clear evidence of a hardware issue.

soxrok2212

06/28/2024, 11:43 PM

Oof this one will be hard given it has to be in the tpi2

soxrok2212

06/28/2024, 11:44 PM

Would this have to be before it panics?

CFSworks

06/28/2024, 11:47 PM

It probably stays enabled even after it panics. Though before doing that, maybe check that beefy inductor (L10) near that corner, since that looks like it's the main inductor for the GPU power converter. I'd first check it visually and then with a continuity tester, just to make sure the wire isn't broken.

soxrok2212

06/28/2024, 11:50 PM

Will do

soxrok2212

06/30/2024, 3:00 PM

i havent had a chance to check that hardware yet, but removin the gpu enable from the device tree allows it to boot

soxrok2212

07/18/2024, 1:32 PM

update: it seems like a hardware issue now. i just rebooted it after operating normally since the previous message, and its panicing on boot. nothing changed.

soxrok2212

07/18/2024, 1:32 PM

Copy code

[    4.472806] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000008
[    4.482717] Mem abort info:
[    4.485846]   ESR = 0x0000000096000004
[    4.490059]   EC = 0x25: DABT (current EL), IL = 32 bits
[    4.496022]   SET = 0, FnV = 0
[    4.499455]   EA = 0, S1PTW = 0
[    4.502976]   FSC = 0x04: level 0 translation fault
[    4.508451] Data abort info:
[    4.511688]   ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000
[    4.517843]   CM = 0, WnR = 0, TnD = 0, TagAccess = 0
[    4.523514]   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
[    4.529476] [0000000000000008] user address but active_mm is swapper
[    4.536606] Internal error: Oops: 0000000096000004 [#1] PREEMPT SMP
[    4.543628] Modules linked in:
[    4.547054] CPU: 3 PID: 31 Comm: cpuhp/3 Not tainted 6.10.0-rc1+ #4
[    4.554076] Hardware name: Turing Machines RK1 (DT)
[    4.559529] pstate: a0400009 (NzCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[    4.567331] pc : blk_mq_hctx_notify_online+0x34/0xb0
[    4.572903] lr : cpuhp_invoke_callback+0x2c4/0x560
[    4.578278] sp : ffff80008249bd50
[    4.581990] x29: ffff80008249bd50 x28: ffff800081da9000 x27: 0000000000000000
[    4.589999] x26: 00000000000000ec x25: ffff0007fbf26c78 x24: 00000000000002f3
[    4.598006] x23: ffff8000807d07cc x22: ffff0007fbf26ca0 x21: ffff000101c03978
[    4.606012] x20: 0000000000000097 x19: ffff000101c03800 x18: ffff80008249bc78
[    4.614018] x17: 000000040044ffff x16: 005000f2b5503510 x15: 0000000000000000
[    4.622023] x14: ffff8000813b11a8 x13: ffffffffffffffff x12:000000034 x7 : ffff000100e8fe08 x6 : ffff000101aa5200
[    4.646021] x5 : ffff800081dad000 x4 : ffff0007fbf26ca0 x3 : 0000000000000003
[    4.654025] x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff000108496698
[    4.662031] Call trace:
[    4.66-
[   14.356782] platform a40000000.pcie: deferred probe pt: deferred probe pending: (reason unknown)

soxrok2212

07/18/2024, 1:42 PM

i just wrote a fresh image to emmc and it paniced. writing the same to another to see what happens

soxrok2212

07/18/2024, 1:42 PM

paniced? panicked? whatever lol

soxrok2212

07/18/2024, 1:55 PM

yeah booted fine. think this one is bricked

soxrok2212

07/18/2024, 2:57 PM

oooooooh now this is weird. i moved another rk1 into this one's slot and now it is panicing

soxrok2212

07/18/2024, 3:06 PM

ok i swapped this rk1 with another that was known working. the one previously panicking no longer panics on the other slot. the one previously working fine now panicks

soxrok2212

07/18/2024, 3:06 PM

@CFSworks could it be a slot issue?

soxrok2212

07/18/2024, 3:07 PM

before node1 - slot 1: good node3 - slot 3: panic after node1 - slot3: panic node3 - slot1: good

soxrok2212

07/18/2024, 3:08 PM

ironically now slot2/node2 is panicking

CFSworks

07/18/2024, 3:27 PM

Have you tried moving around whatever is in the M.2 slots? This might be the PCIe CLKREQ# issue manifesting yet again.

soxrok2212

07/18/2024, 4:10 PM

all m.2 have nvme slots. i did try emmc booting though, so wouldnt that negate that?

soxrok2212

07/18/2024, 4:11 PM

emmc still panicked

soxrok2212

07/18/2024, 4:16 PM

hmm actually i wonder

soxrok2212

07/18/2024, 4:16 PM

Copy code

** File not found ubootefi.var **
Failed to load EFI variables
** Unable to write file ubootefi.var **
Failed to persist EFI variables
** Unable to write file ubootefi.var **
Failed to persist EFI variables
** Unable to write file ubootefi.var **
Failed to persist EFI variables
  0  efi_mgr      ready   (none)       0  <NULL>                   
** Booting bootflow '<NULL>' with efi_mgr
Loading Boot0000 'mmc 0' failed
EFI boot manager: Cannot load any image
Boot failed (err=-14)
pcie_dw_rockchip pcie@fe180000: PCIe-0 Link Fail
** Unable to write file ubootefi.var **
Failed to persist EFI variables
k#1.bootdev.part_3' with extlinux

soxrok2212

07/18/2024, 7:31 PM

on a side note, this was a very good test of my backup strategy lmao.

CFSworks

09/10/2024, 3:16 AM

I'm getting this on a RK1.0 while trying to test the GPU out myself. I'm wondering if this is just a hardware flaw in this early version, though.

soxrok2212

09/10/2024, 9:15 AM

soxrok2212

09/10/2024, 10:02 AM

@DhanOS is this known?

soxrok2212

09/10/2024, 10:03 AM

@CFSworks did you try multiple? I was only able to trigger it on one of my RK1s

DhanOS

09/10/2024, 10:23 AM

All the changes between v1.0 and v1.2 are not related to any of this. It's mostly the right "flashing" mode and HDMI/EDID fixes

Spooky

09/10/2024, 4:13 PM

Hmm GPU has been fine for me, I've been testing with Linux 6.11-rc5

CFSworks

09/10/2024, 6:41 PM

This might be a race where Linux is switching off the VDD_GPU_S0 regulator before trying to power up the GPU. Perhaps "race" is not the right word because the command I'm using is

echo fb000000.gpu > /sys/bus/platform/drivers/panthor/bind

post-booting. So the bug might be that there's nothing in the DT that tells Linux that the GPU power domain depends on the VDD_GPU_S0 regulator, only that the GPU itself depends on the regulator, and it's just gone unnoticed because I'm the one person who's trying to enable the GPU well after boot when the regulator gets switched off.

CFSworks

09/10/2024, 9:57 PM

Adding

regulator-always-on;

under

vdd_gpu_s0: vdd_gpu_mem_s0: dcdc-reg1

made the problem go away. This dependency of the GPU power domain on the external regulator should really be memorialized in the DT somehow. Time to see if I can run an OpenCL workload as a quick test. 🤔 @DhanOS For your personal "support knowledgebase" -- this error on RK1 means "GPU is not receiving power from the board" in my case due to a software issue but can also mean that the regulator is damaged

Copy code

[  143.054645] rockchip-pm-domain fd8d8000.power-management:power-controller: failed to set domain 'gpu', val=0
[  145.762181] rockchip-pm-domain fd8d8000.power-management:power-controller: failed to get ack on domain 'gpu', val=0x1bffff

soxrok2212

09/10/2024, 9:58 PM

No way!

soxrok2212

09/10/2024, 9:59 PM

Do you have an icd?

DhanOS

09/10/2024, 9:59 PM

Cool!

DhanOS

09/10/2024, 10:00 PM

> This dependency of the GPU power domain on the external regulator should really be memorialized in the DT somehow. So when the GPU is about to be initialized, BUCK1 starts delivering power to the GPU?

CFSworks

09/10/2024, 10:03 PM

The GPU driver knows about BUCK1 as a "here's what you adjust if you want more or less voltage" knob, but that's it. The SoC power domain controller also depends on BUCK1 active to bring up the GPU power domain, but the DT doesn't tell Linux about this. The above dmesg error is the PD controller saying "I don't know why, but the GPU power domain isn't powering up." Odds are it works everywhere else just because the GPU driver happens to initialize before Linux does its "turn off all unused power regulators" pass.

DhanOS

09/10/2024, 10:04 PM

Yeah, I more asked if you mean to keep BUCK1 always on or make it a dependency when the GPU is about to be initialized

CFSworks

09/10/2024, 10:05 PM

I'm trying to figure that out myself. It looks like the latest version of the Panfrost driver in Mesa can also operate the "Valhall" devices that the Panthor kernel driver runs. I'm not sure whether this does OpenCL or just OpenGL. I might have to do a GL test instead. EDIT:

Other graphics APIs (Vulkan, OpenCL) are not supported at this time.

okay so it's just OpenGL for now.

CFSworks

09/10/2024, 10:06 PM

Ahhh got it. I'd prefer to go with the latter because that's more power-friendly to users who don't care to enable the GPU driver.

DhanOS

09/10/2024, 10:06 PM

This is exactly why I asked (the same argument) 😄

DhanOS

09/10/2024, 10:06 PM

I was unsure after reading the message I replied to before

soxrok2212

09/10/2024, 10:33 PM

Source?

soxrok2212

09/10/2024, 10:33 PM

Not that I’m doubting you, just want to read up

CFSworks

09/10/2024, 10:33 PM

[This page](https://docs.mesa3d.org/drivers/panfrost.html)

CFSworks

09/10/2024, 10:33 PM

Mali-G610 is ours

soxrok2212

09/10/2024, 10:45 PM

Ty!

soxrok2212

09/10/2024, 11:40 PM

I might have to hop on the panfrost irc and check the OpenCL status

soxrok2212

09/10/2024, 11:40 PM

OpenCL and video en/decode is all I care about

CFSworks

09/10/2024, 11:52 PM

Make sure you have set CONFIG_VIDEO_HANTRO_ROCKCHIP to enable the VPU driver. I should probably look at whether the VPU needs anything special once I'm done with GPU. (Though the only format supported right now is AV1.)

soxrok2212

09/10/2024, 11:52 PM

I think there’sa few more that just landed in collabora git

CFSworks

09/10/2024, 11:53 PM

Ah sick

soxrok2212

09/10/2024, 11:55 PM

JPEG encoder is in 6.12

soxrok2212

09/10/2024, 11:56 PM

VP8 decoder and MPEG-2/MPEG-4 decoder in 6.12

soxrok2212

09/10/2024, 11:56 PM

Really all I care about is av1 decode and h.264 encode haha

CFSworks

09/10/2024, 11:57 PM

The JPEG encoder might be my test target then.

soxrok2212

09/11/2024, 12:00 AM

Oh wait there’s an h.264 patch

23 Views

Previous Next