Bug 6559

Summary: wireguard causes CPU soft lockup
Product: Fedora EPEL Reporter: Brian J. Murrell <brian>
Component: wireguard-kmodAssignee: Nicolas Chauvet <kwizart>
Status: RESOLVED WONTFIX    
Severity: enhancement    
Priority: P1    
Version: 8   
Hardware: x86_64   
OS: GNU/Linux   
namespace:

Description Brian J. Murrell 2023-01-16 23:30:04 CET
Apologies for the incorrect Component.  It seems there are no matching Components for Wireguard here (yet?).

Using kmod-wireguard-1.0.20220627-1.el8.x86_64 wireguard is causing CPU soft lockups.  One such instance is:

[  856.076833] watchdog: BUG: soft lockup - CPU#1 stuck for 23s! [kworker/1:0:3943]
[  856.084465] Modules linked in: wireguard(E) ip6_udp_tunnel udp_tunnel nft_counter xt_conntrack ipt_REJECT xt_comment xt_owner nft_compat nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nf_log_syslog nft_log nft_ct nf_tables_set nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink vfat fat intel_rapl_msr intel_rapl_common pcspkr joydev i2c_piix4 xfs libcrc32c nvme_tcp(X) nvme_fabrics nvme nvme_core ata_generic sd_mod t10_pi sg crct10dif_pclmul crc32_pclmul crc32c_intel ata_piix bochs drm_vram_helper drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm_ttm_helper ttm drm libata virtio_net ghash_clmulni_intel net_failover virtio_scsi failover serio_raw dm_multipath sunrpc dm_mirror dm_region_hash dm_log dm_mod be2iscsi bnx2i cnic uio cxgb4i cxgb4 tls libcxgbi libcxgb qla4xxx iscsi_boot_sysfs iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi
[  856.232782] CPU: 1 PID: 3943 Comm: kworker/1:0 Kdump: loaded Tainted: G            EL X --------- -  - 4.18.0-425.10.1.el8_7.x86_64 #1
[  856.298159] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.4.1 12/03/2020
[  856.314897] Workqueue: events_power_efficient wg_ratelimiter_gc_entries [wireguard]
[  856.323200] RIP: 0010:native_safe_halt+0xe/0x20
[  856.377855] Code: 00 f0 80 48 02 20 48 8b 00 a8 08 75 c0 e9 79 ff ff ff 90 90 90 90 90 90 90 90 90 90 e9 07 00 00 00 0f 00 2d d6 96 41 00 fb f4 <e9> ed 09 21 00 66 66 2e 0f 1f 84 00 00 00 00 00 66 90 e9 07 00 00
[  856.393980] RSP: 0018:ffffb36b401d7e10 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13
[  856.401487] RAX: 0000000000000003 RBX: ffffffffc0986fb8 RCX: 0000000000000008
[  856.479314] RDX: 0000000000000000 RSI: 0000000000000003 RDI: ffffffffc0986fb8
[  856.483572] RBP: ffff8ac6bd32bcc0 R08: 0000000000000008 R09: 0000000000000054
[  856.490282] R10: 8080808080808080 R11: 0000000000000018 R12: 0000000000000000
[  856.499665] R13: 0000000000000001 R14: 0000000000000100 R15: 0000000000080000
[  856.510546] FS:  0000000000000000(0000) GS:ffff8ac6bd300000(0000) knlGS:0000000000000000
[  856.516042] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  856.577630] CR2: 000055b54cf9b000 CR3: 000000000c5fa000 CR4: 00000000003506e0
[  856.583036] Call Trace:
[  856.584837]  kvm_wait+0x58/0x60
[  856.586900]  __pv_queued_spin_lock_slowpath+0x268/0x2a0
[  856.590391]  _raw_spin_lock+0x1e/0x30
[  856.592722]  wg_ratelimiter_gc_entries+0x49/0x170 [wireguard]
[  856.596470]  process_one_work+0x1a7/0x360
[  856.598964]  ? create_worker+0x1a0/0x1a0
[  856.601373]  worker_thread+0x30/0x390
[  856.677856]  ? create_worker+0x1a0/0x1a0
[  856.681158]  kthread+0x10b/0x130
[  856.683423]  ? set_kthread_struct+0x50/0x50
[  856.686042]  ret_from_fork+0x35/0x40
[  884.076538] watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [kworker/1:0:3943]
[  884.083439] Modules linked in: wireguard(E) ip6_udp_tunnel udp_tunnel nft_counter xt_conntrack ipt_REJECT xt_comment xt_owner nft_compat nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nf_log_syslog nft_log nft_ct nf_tables_set nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink vfat fat intel_rapl_msr intel_rapl_common pcspkr joydev i2c_piix4 xfs libcrc32c nvme_tcp(X) nvme_fabrics nvme nvme_core ata_generic sd_mod t10_pi sg crct10dif_pclmul crc32_pclmul crc32c_intel ata_piix bochs drm_vram_helper drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm_ttm_helper ttm drm libata virtio_net ghash_clmulni_intel net_failover virtio_scsi failover serio_raw dm_multipath sunrpc dm_mirror dm_region_hash dm_log dm_mod be2iscsi bnx2i cnic uio cxgb4i cxgb4 tls libcxgbi libcxgb qla4xxx iscsi_boot_sysfs iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi
[  884.292111] CPU: 1 PID: 3943 Comm: kworker/1:0 Kdump: loaded Tainted: G            EL X --------- -  - 4.18.0-425.10.1.el8_7.x86_64 #1
[  884.301646] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.4.1 12/03/2020
[  884.308227] Workqueue: events_power_efficient wg_ratelimiter_gc_entries [wireguard]
[  884.378841] RIP: 0010:__pv_queued_spin_lock_slowpath+0xe6/0x2a0
[  884.383342] Code: 14 41 bd 01 00 00 00 41 be 00 01 00 00 3c 02 0f 94 c0 0f b6 c0 48 89 04 24 c6 45 14 00 ba 00 80 00 00 c6 43 01 01 eb 0b f3 90 <83> ea 01 0f 84 5d 01 00 00 0f b6 03 84 c0 75 ee 44 89 f0 f0 66 44
[  884.395145] RSP: 0018:ffffb36b401d7e20 EFLAGS: 00000206 ORIG_RAX: ffffffffffffff13
[  884.402593] RAX: 0000000000000003 RBX: ffffffffc0986fb8 RCX: 0000000000000008
[  884.476620] RDX: 000000000000145f RSI: 0000000000000003 RDI: ffffffffc0986fb8
[  884.481926] RBP: ffff8ac6bd32bcc0 R08: 0000000000000008 R09: 0000000000000054
[  884.487165] R10: 8080808080808080 R11: 0000000000000018 R12: 0000000000000000
[  884.493408] R13: 0000000000000001 R14: 0000000000000100 R15: 0000000000080000
[  884.500763] FS:  0000000000000000(0000) GS:ffff8ac6bd300000(0000) knlGS:0000000000000000
[  884.505802] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  884.510088] CR2: 000055b54cf9b000 CR3: 000000000c5fa000 CR4: 00000000003506e0
[  884.516203] Call Trace:
[  884.580390]  _raw_spin_lock+0x1e/0x30
[  884.583718]  wg_ratelimiter_gc_entries+0x49/0x170 [wireguard]
[  884.592481]  process_one_work+0x1a7/0x360
[  884.597203]  ? create_worker+0x1a0/0x1a0
[  884.601466]  worker_thread+0x30/0x390
[  884.604804]  ? create_worker+0x1a0/0x1a0
[  884.611012]  kthread+0x10b/0x130
[  884.617201]  ? set_kthread_struct+0x50/0x50
[  884.678700]  ret_from_fork+0x35/0x40

Not sure if this is the same as https://elrepo.org/bugs/view.php?id=1283 but I found that in my searches.

Any thoughts?
Comment 1 Nicolas Chauvet 2023-01-17 08:17:28 CET
Thanks for the report:

https://koji.rpmfusion.org/koji/taskinfo?taskID=580262

I've backported a fix from elrepo (not upstream unfortunately).

The package will be pushed to testing repos in the coming days...


Please report.
Comment 2 Nicolas Chauvet 2023-01-17 08:27:53 CET
component created.
Comment 3 Brian J. Murrell 2023-01-17 14:10:54 CET
Thanks for the update.  However https://koji.rpmfusion.org/koji/taskinfo?taskID=580262 has the kernel module for kernel 4.18.0-425.3.1.el8.x86_64 however EL8.7 is currently on 4.18.0-425.10.1.el8_7.x86_64.

While I can appreciate that there is a[n a]kmod for it, if a binary kernel module is going to be produced, shouldn't it be for the current kernel?
Comment 4 Brian J. Murrell 2023-01-17 14:22:04 CET
Oh, wait.  I suppose kmod-wireguard-4.18.0-425.3.1.el8.x86_64-1.0.20220627-2.el8.x86_64 is kABI module that will also load into the 4.18.0-425.10.1.el8_7.x86_64 kernel.

If so, then the update in https://koji.rpmfusion.org/koji/taskinfo?taskID=580262 does not fix the soft lockups.  So maybe this one is different from https://elrepo.org/bugs/view.php?id=1283?
Comment 5 Nicolas Chauvet 2023-01-17 17:16:13 CET
I've tried another attempt
https://koji.rpmfusion.org/koji/taskinfo?taskID=580272

Basically it's the same patch on the same kernel that elrepo has applied.

Alternative would be that newer RHEL kernel has broke kABI with this particular module, so I will have to rebuild against a newer kernel (despite the minor version is the same).


If still not work, please see:
https://lists.zx2c4.com/pipermail/wireguard/2022-June/007664.html

IN which case I would suggest to consider using
https://copr.fedorainfracloud.org/coprs/kwizart/kernel-longterm-5.15/
That are wireguard support...
(or move to RHEL9)...


Still your feedback is appreciate and also if you can figure out a patch.
rfpkg clone free/wireguard-kmod
Comment 6 Brian J. Murrell 2023-01-17 17:50:58 CET
Unfortunately that new build is still causing soft lockups:

[12897.416072] wireguard: module verification failed: signature and/or required key missing - tainting kernel
[12897.424844] wireguard: WireGuard 1.0.20220627 loaded. See www.wireguard.com for information.
[12897.429423] wireguard: Copyright (C) 2015-2019 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved.
[12923.108393] watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [kworker/1:4:60411]
[12923.113325] Modules linked in: wireguard(E) ip6_udp_tunnel udp_tunnel binfmt_misc ib_core mptcp_diag xsk_diag tcp_diag udp_diag raw_diag inet_diag unix_diag af_packet_diag netlink_diag tun nft_counter xt_conntrack ipt_REJECT xt_comment xt_owner nft_compat nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nf_log_syslog nft_log nft_ct nf_tables_set nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink vfat fat intel_rapl_msr intel_rapl_common joydev pcspkr i2c_piix4 xfs libcrc32c nvme_tcp(X) nvme_fabrics nvme nvme_core crct10dif_pclmul crc32_pclmul crc32c_intel bochs sd_mod drm_vram_helper drm_kms_helper t10_pi syscopyarea sg sysfillrect sysimgblt fb_sys_fops drm_ttm_helper ttm drm ata_generic ghash_clmulni_intel ata_piix virtio_net libata serio_raw virtio_scsi net_failover failover dm_multipath sunrpc dm_mirror dm_region_hash dm_log dm_mod be2iscsi bnx2i cnic uio cxgb4i cxgb4 tls libcxgbi libcxgb
[12923.113396]  qla4xxx iscsi_boot_sysfs iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi
[12923.223678] Red Hat flags: eBPF/rawtrace
[12923.225862] CPU: 1 PID: 60411 Comm: kworker/1:4 Kdump: loaded Tainted: G            E  X --------- -  - 4.18.0-425.10.1.el8_7.x86_64 #1
[12923.232543] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.4.1 12/03/2020
[12923.236979] Workqueue: events_power_efficient wg_ratelimiter_gc_entries [wireguard]
[12923.241227] RIP: 0010:native_safe_halt+0xe/0x20
[12923.243873] Code: 00 f0 80 48 02 20 48 8b 00 a8 08 75 c0 e9 79 ff ff ff 90 90 90 90 90 90 90 90 90 90 e9 07 00 00 00 0f 00 2d d6 96 41 00 fb f4 <e9> ed 09 21 00 66 66 2e 0f 1f 84 00 00 00 00 00 66 90 e9 07 00 00
[12923.302473] RSP: 0018:ffffa60e45317e10 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13
[12923.306652] RAX: 0000000000000003 RBX: ffffffffc09b5fb8 RCX: 0000000000000008
[12923.310790] RDX: 0000000000000000 RSI: 0000000000000003 RDI: ffffffffc09b5fb8
[12923.314931] RBP: ffff8f71bd32bcc0 R08: 0000000000000008 R09: 000000000000005c
[12923.320546] R10: 8080808080808080 R11: 0000000000000018 R12: 0000000000000000
[12923.325500] R13: 0000000000000001 R14: 0000000000000100 R15: 0000000000080000
[12923.329398] FS:  0000000000000000(0000) GS:ffff8f71bd300000(0000) knlGS:0000000000000000
[12923.334009] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[12923.337117] CR2: 00005618fbfb2f44 CR3: 0000000018c10000 CR4: 00000000003506e0
[12923.340925] Call Trace:
[12923.393409]  kvm_wait+0x58/0x60
[12923.397646]  __pv_queued_spin_lock_slowpath+0x268/0x2a0
[12923.401519]  _raw_spin_lock+0x1e/0x30
[12923.403835]  wg_ratelimiter_gc_entries+0x49/0x170 [wireguard]
[12923.407317]  process_one_work+0x1a7/0x360
[12923.410226]  worker_thread+0x30/0x390
[12923.412572]  ? create_worker+0x1a0/0x1a0
[12923.415222]  kthread+0x10b/0x130
[12923.417588]  ? set_kthread_struct+0x50/0x50
[12923.420876]  ret_from_fork+0x35/0x40

What is strange is that this all worked for a few days (at least) before I rebooted the machine where it started with this soft lockup business.
Comment 7 Nicolas Chauvet 2023-01-17 19:02:47 CET
Can you verify that 4.18.0-425.3.1.el8.x86_64 also has the error ?

Maybe you could give a try to akmod-wireguard ?
Comment 8 Brian J. Murrell 2023-01-17 21:38:38 CET
FWIW:

Name        : kmod-wireguard
Epoch       : 7
Version     : 1.0.20220627
Release     : 4.el8_7.elrepo
Architecture: x86_64
Install Date: Tue 17 Jan 2023 05:57:12 PM GMT
Group       : System Environment/Kernel
Size        : 363219
License     : GPLv2
Signature   : DSA/SHA256, Sat 14 Jan 2023 10:25:50 PM GMT, Key ID 309bc305baadae52
Source RPM  : kmod-wireguard-1.0.20220627-4.el8_7.elrepo.src.rpm
Build Date  : Sat 14 Jan 2023 10:23:28 PM GMT
Build Host  : Build64R8.elrepo.org
Relocations : (not relocatable)
Packager    : Philip J Perry <phil@elrepo.org>
Vendor      : The ELRepo Project (https://elrepo.org)
URL         : https://git.zx2c4.com/wireguard-linux-compat/

seems to be working.  I'd rather only configure rpmfusion TBH though.
Comment 9 Nicolas Chauvet 2023-01-17 22:17:37 CET
Latest wireguard package from elrepo is a rebuild with newer kernel.

But the release bump wasn't published in git 
https://github.com/elrepo/packages/blob/master/wireguard-kmod/el8/wireguard-kmod.spec#L19

This break our assumption as we expect to build for the GA kernel, not any more or less random kernel during a minor release.
That's for reproducibility with all our kmod.

Now I can probably drop that and build with newer kernel...

Now the issue will remain, wireguard on RHEL <9 will break from time to time...
Comment 10 Brian J. Murrell 2023-01-18 21:04:13 CET
EL Repo bug tracker references

https://access.redhat.com/solutions/6985596 and https://elrepo.org/bugs/view.php?id=1316 as the cause of this and specifically councils rebuilding the kmod on 
4.18.0-425.10.1.el8_7.
Comment 11 Nicolas Chauvet 2023-01-27 11:22:52 CET
OK,
here is a new attempt with the newer kernel as a base:
https://koji.rpmfusion.org/koji/buildinfo?buildID=24709
Comment 12 Nicolas Chauvet 2023-04-17 09:35:07 CEST
Anyone to confirm usability with current package on current kernel ?
Comment 13 Nicolas Chauvet 2023-11-29 09:51:15 CET
Upstream wireguard has stopped supporting RHEL8 kernel and RHEL9 has wireguard by default.
Also this kmod failstobuild from sources, so I've removed it from the el8 repos.

Closing to WONTFIX