Bug 3602

Summary: Kernel errors & Gnome locks up with nvidia-340xx
Product: Fedora Reporter: Russ Odom <russ+bugzilla-rf>
Component: nvidia-340xx-kmodAssignee: Nicolas Chauvet <kwizart>
Status: RESOLVED WONTFIX    
Severity: normal CC: kwizart, leigh123linux, pprzemal, russ+bz-rf
Priority: P5    
Version: 23   
Hardware: All   
OS: GNU/Linux   
namespace:
Attachments: nvidia-bug-report.sh output
nvidia-bug-report.sh output

Description Russ Odom 2015-04-21 22:44:27 CEST
I've just upgraded from Fedora 19 to 21, and am using the nvidia-304xx drivers. After a few mins (exact time varies, but not very long), Gnome locks up with a flickering/rolling corrupted display. Machine remains sort-of responsive over SSH for a while, other processes on the box are still running, but gradually becomes less usable and eventually requires hard reset.

I have tried without an xorg.conf, AND with one created by nvidia-xconfig, with the same result.

[root@gigalith ~]# lspci | grep VGA
02:00.0 VGA compatible controller: NVIDIA Corporation C77 [GeForce 8300] (rev a2)
[root@gigalith ~]# rpm -qa | grep nvidia
kmod-nvidia-340xx-3.19.3-200.fc21.x86_64-340.76-2.fc21.5.x86_64
xorg-x11-drv-nvidia-340xx-340.76-1.fc21.x86_64
xorg-x11-drv-nvidia-340xx-libs-340.76-1.fc21.x86_64
akmod-nvidia-340xx-340.76-2.fc21.x86_64
xorg-x11-drv-nvidia-340xx-kmodsrc-340.76-1.fc21.x86_64
[root@gigalith ~]# uname -a
Linux gigalith.example.com 3.19.3-200.fc21.x86_64 #1 SMP Thu Mar 26 21:39:42 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

At the point it fails, this appears in the log:
Apr 21 20:41:19 gigalith.example.com kernel: NVRM: GPU at PCI:0000:02:00: GPU-63b85a5e-da4e-d401-32f7-5b972a9d7c60
Apr 21 20:41:19 gigalith.example.com kernel: NVRM: Xid (PCI:0000:02:00): 26, Ch 00000001 M 00000324 D 00000101 intr 00400000
Apr 21 20:41:19 gigalith.example.com kernel: NVRM: Xid (PCI:0000:02:00): 26, Ch 00000001 M 000012e8 D 00000001 intr 04c00000

[...time passes... (irrelevant lines snipped)]

Apr 21 20:42:19 gigalith.example.com kernel: INFO: rcu_sched detected stalls on CPUs/tasks: { 1} (detected by 2, t=60004 jiffies, g=35176, c=35175, q=0)
Apr 21 20:42:19 gigalith.example.com kernel: Task dump for CPU 1:
Apr 21 20:42:19 gigalith.example.com kernel: Xorg.bin        R  running task        0  4134   4128 0x0040000c
Apr 21 20:42:19 gigalith.example.com kernel:  ffffffff8177008a ffff8800cb01c3d0 0000000000014580 ffff8800d4c1bfd8
Apr 21 20:42:19 gigalith.example.com kernel:  0000000000014580 ffff8800d35c9360 ffff8800cb01c3d0 0000000038f089a3
Apr 21 20:42:19 gigalith.example.com kernel:  0000000000000000 00000000000cbf6e 0031acfd009ede07 00007ffcf842432c
Apr 21 20:42:19 gigalith.example.com kernel: Call Trace:
Apr 21 20:42:19 gigalith.example.com kernel:  [<ffffffff8177008a>] ? __schedule+0x2ca/0x850
Apr 21 20:42:19 gigalith.example.com kernel:  [<ffffffff81770aa5>] ? schedule_user+0x35/0xb0

[...more time passes...]

Apr 21 20:42:40 gigalith.example.com kernel: NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [Xorg.bin:4134]
Apr 21 20:42:40 gigalith.example.com kernel: Modules linked in: bnep bluetooth rfkill fuse rc_dib0700_rc5 xt_TARPIT(OE) dib7000p gspca_sn9c20x gspca_main dvb_usb_dib0700 dib7000m dib0090 dib0070 dib3000mc dibx000_common dvb_usb videodev dvb_core media rc_core usblp binfmt_misc nf_log_ipv4 nf_conntrack_ipv4 nf_defrag_ipv4 ip6t_REJECT nf_reject_ipv6 nf_log_ipv6 nf_log_common xt_LOG nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter xt_conntrack nf_conntrack ip6_tables snd_hda_codec_hdmi linear snd_hda_codec_via snd_hda_codec_generic snd_hda_intel snd_hda_controller kvm_amd nvidia(POE) snd_hda_codec kvm snd_hwdep snd_seq snd_seq_device snd_pcm drm edac_core snd_timer serio_raw k10temp snd edac_mce_amd soundcore asus_atk0110 shpchp i2c_nforce2 acpi_cpufreq nfsd auth_rpcgss nfs_acl lockd grace sunrpc raid1 forcedeth ata_generic pata_acpi
Apr 21 20:42:40 gigalith.example.com kernel:  video wmi uas usb_storage
Apr 21 20:42:40 gigalith.example.com kernel: CPU: 1 PID: 4134 Comm: Xorg.bin Tainted: P           OEL 3.19.3-200.fc21.x86_64 #1
Apr 21 20:42:40 gigalith.example.com kernel: Hardware name: System manufacturer System Product Name/M4N78 PRO, BIOS 1101    09/02/2009
Apr 21 20:42:40 gigalith.example.com kernel: task: ffff8800cb01c3d0 ti: ffff8800d4c18000 task.ti: ffff8800d4c18000
Apr 21 20:42:40 gigalith.example.com kernel: RIP: 0010:[<ffffffffa09f0d63>]  [<ffffffffa09f0d63>] _nv013411rm+0x13/0x40 [nvidia]
Apr 21 20:42:40 gigalith.example.com kernel: RSP: 0000:ffff88011fc83c48  EFLAGS: 00003216
Apr 21 20:42:40 gigalith.example.com kernel: RAX: 000000000001c000 RBX: ffffffffa09a9d9a RCX: 000000000001c000
Apr 21 20:42:40 gigalith.example.com kernel: RDX: ffffc90012200000 RSI: ffff88011a040008 RDI: ffff8800d5211008
Apr 21 20:42:40 gigalith.example.com kernel: RBP: ffff88011a102a40 R08: 0000000000000020 R09: ffff88011a102a54
Apr 21 20:42:40 gigalith.example.com kernel: R10: ffff8800cc445008 R11: ffffffffa07820a0 R12: ffff88011fc83bb8
Apr 21 20:42:40 gigalith.example.com kernel: R13: ffffffff817758fd R14: ffff88011a102a40 R15: 0000000000000002
Apr 21 20:42:40 gigalith.example.com kernel: FS:  00007f82c98e39c0(0000) GS:ffff88011fc80000(0000) knlGS:00000000efe547c0
Apr 21 20:42:40 gigalith.example.com kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Apr 21 20:42:40 gigalith.example.com kernel: CR2: 00007f74212d7640 CR3: 00000000ca99d000 CR4: 00000000000007e0
Apr 21 20:42:40 gigalith.example.com kernel: Stack:
Apr 21 20:42:40 gigalith.example.com kernel:  ffff88011a040008 ffffffffa0782196 ffff88011a040008 0000000000000000
Apr 21 20:42:40 gigalith.example.com kernel:  0000000000000000 ffff88011a102a58 ffffffff00000000 ffffffffa04ec9ee
Apr 21 20:42:40 gigalith.example.com kernel:  ffff88011a040008 ffff88011a040008 ffff8800cc445008 0000000000000388
Apr 21 20:42:40 gigalith.example.com kernel: Call Trace:
Apr 21 20:42:40 gigalith.example.com kernel:  <IRQ> 
Apr 21 20:42:40 gigalith.example.com kernel: 
Apr 21 20:42:40 gigalith.example.com kernel:  [<ffffffffa0782196>] ? _nv008900rm+0x1896/0x5c10 [nvidia]
Apr 21 20:42:40 gigalith.example.com kernel:  [<ffffffffa04ec9ee>] ? _nv001242rm+0xae/0xe0 [nvidia]
Apr 21 20:42:40 gigalith.example.com kernel:  [<ffffffffa04ecb78>] ? _nv002033rm+0x158/0x1e0 [nvidia]
Apr 21 20:42:40 gigalith.example.com kernel:  [<ffffffffa04e7b5a>] ? _nv002271rm+0x3aa/0x710 [nvidia]
Apr 21 20:42:40 gigalith.example.com kernel:  [<ffffffffa04ee151>] ? _nv002152rm+0x601/0xba0 [nvidia]
Apr 21 20:42:40 gigalith.example.com kernel:  [<ffffffffa04eea7e>] ? _nv002174rm+0x38e/0x530 [nvidia]
Apr 21 20:42:40 gigalith.example.com kernel:  [<ffffffffa0748149>] ? _nv008018rm+0xa9/0x1f0 [nvidia]
Apr 21 20:42:40 gigalith.example.com kernel:  [<ffffffffa07417d1>] ? _nv007859rm+0x2a1/0x860 [nvidia]
Apr 21 20:42:40 gigalith.example.com kernel:  [<ffffffffa0741e58>] ? _nv007868rm+0xc8/0xa00 [nvidia]
Apr 21 20:42:40 gigalith.example.com kernel:  [<ffffffffa082a7cb>] ? _nv012122rm+0x1bb/0x800 [nvidia]
Apr 21 20:42:40 gigalith.example.com kernel:  [<ffffffffa0821087>] ? _nv012103rm+0x47/0x70 [nvidia]
Apr 21 20:42:40 gigalith.example.com kernel:  [<ffffffffa09f2b65>] ? _nv000791rm+0x105/0x140 [nvidia]
Apr 21 20:42:40 gigalith.example.com kernel:  [<ffffffffa09f7913>] ? rm_isr_bh+0x23/0x70 [nvidia]
Apr 21 20:42:40 gigalith.example.com kernel:  [<ffffffffa0a0506b>] ? nvidia_isr_bh+0x3b/0x60 [nvidia]
Apr 21 20:42:40 gigalith.example.com kernel:  [<ffffffff8109fd86>] ? tasklet_action+0xe6/0xf0
Apr 21 20:42:40 gigalith.example.com kernel:  [<ffffffff810a01ab>] ? __do_softirq+0x10b/0x2b0
Apr 21 20:42:40 gigalith.example.com kernel:  [<ffffffff810a0565>] ? irq_exit+0x125/0x130
Apr 21 20:42:40 gigalith.example.com kernel:  [<ffffffff81777788>] ? do_IRQ+0x58/0xf0
Apr 21 20:42:40 gigalith.example.com kernel:  [<ffffffff8177556d>] ? common_interrupt+0x6d/0x6d
Apr 21 20:42:40 gigalith.example.com kernel:  <EOI> 
Apr 21 20:42:40 gigalith.example.com kernel: Code: 
Apr 21 20:42:40 gigalith.example.com kernel: ff e8 82 b3 fc ff 0f b7 c3 5b c3 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 53 31 db 39 4a 10 76 0f 48 8b 12 c1 e9 02 89 c8 8b 1c 82 <89> d8 5b c3 31 ff e8 22 12 00 00 be 01 00 00 00 48 89 c2 31 ff 
[...this repeats periodically with a similar trace]

Suggestions appreciated! Let me know if I can provide more info.
Comment 1 Nicolas Chauvet 2015-04-21 23:38:42 CEST
Please attach the output of nvidia bug report.sh
Also consider to forward to nvidia
Comment 2 Russ Odom 2015-04-22 10:11:10 CEST
Created attachment 1428 [details]
nvidia-bug-report.sh output
Comment 3 Russ Odom 2015-04-22 10:23:52 CEST
Subsequently to the above attachment, I discovered the instructions at https://devtalk.nvidia.com/default/topic/522835/linux/if-you-have-a-problem-please-read-this-first/ - so it might not contain all the info needed. I will retry when time permits (might be a few days).
Comment 4 Przemysław Palacz 2015-04-22 15:36:18 CEST
Do not forget to install xorg-x11-drv-nvidia-340xx-cuda package to generate a full bug report before sending it to Nvidia.
Comment 5 Russ Odom 2015-04-28 22:12:36 CEST
Created attachment 1429 [details]
nvidia-bug-report.sh output

With Gnome running this time. Note I've upgraded kernel since last run - now on kernel-3.19.5-200.fc21.x86_64.
Comment 6 Nicolas Chauvet 2015-04-28 22:25:15 CEST
Here are the GPU locks up:

Apr 21 20:19:22 gigalith.example.com kernel: NVRM: loading NVIDIA UNIX x86_64 Kernel Module  340.76  Thu Jan 22 12:11:08 PST 2015
Apr 21 20:41:19 gigalith.example.com kernel: NVRM: GPU at PCI:0000:02:00: GPU-63b85a5e-da4e-d401-32f7-5b972a9d7c60
Apr 21 20:41:19 gigalith.example.com kernel: NVRM: Xid (PCI:0000:02:00): 26, Ch 00000001 M 00000324 D 00000101 intr 00400000
Apr 21 20:41:19 gigalith.example.com kernel: NVRM: Xid (PCI:0000:02:00): 26, Ch 00000001 M 000012e8 D 00000001 intr 04c00000

Everything is in good shape with the driver package, so please report the issue to nvidia and keep the bug informed.
Comment 8 Nicolas Chauvet 2015-05-10 18:41:45 CEST
I've reproduced the issue with 3.19.x kernel (previously my issue with 3.19.x was btrfs related).
I've succeeded to boot using 
  rd.driver.blacklist=nvidia
And removed ::
  rhgb quiet nouveau.modeset=0 rd.driver.blacklist=nouveau video=vesa:off

I'm still using gfxpayload=text

I still need to test several boot option combinaison to discover which option made the boot to fail or succeed.
Comment 9 Nicolas Chauvet 2015-05-10 18:43:07 CEST
@Przemysław
Are you able to reproduce with 3.19 kernel ?
Comment 10 Nicolas Chauvet 2015-05-10 18:43:36 CEST
BTW, I'm using :
02:00.0 VGA compatible controller: NVIDIA Corporation G96GL [Quadro FX 580] (rev a1)
Comment 11 Przemysław Palacz 2015-05-10 23:00:43 CEST
I'm using KDE Plasma 5 and I don't think I've this exact issue - no traces in journal or dmesg, 9600gt.

However I've been experiencing some freezes under KDE too but no kernel/xorg crashes, yet...
Actually I had quite a lot of freezes but most of them stopped after switching window decoration from breeze to oxygen.

The other one was actually caused by the Nvidia driver but KWIN_EXPLICIT_SYNC=0 worked around that, more info at https://bugs.kde.org/show_bug.cgi?id=346116

I'm thinking about trying out the latest Gnome so I'll report back with the experience/results when I finally get around to install F22...
Comment 12 Emmanuel Seyman 2015-12-04 13:48:30 CET
RPMFusion is no longer releasing updates for this version of Fedora. This bug
will be set to RESOLVED:EXPIRED next week to reflect this.

If the problem persists after upgrading to the latest version of Fedora, please
update the version field of this bug (and re-open it if it has been closed).
Comment 13 Emmanuel Seyman 2015-12-13 11:12:34 CET
Closing this bug with the EXPIRED resolution since Fedora no longer ships updates for this version of Fedora.

Please set the Version field to a supported version of Fedora if you re-open this bug.
Comment 14 Russell Odom 2015-12-16 21:27:55 CET
I have replicated this issue on Fedora 23, with the 340xx drivers. Switched back to the 304xx ones again for now (I have another issue with those but at least they don't cause my system to hang).
Comment 15 leigh scott 2016-07-13 09:08:45 CEST
I doubt nvidia will ever fix this issue in the legacy driver and there is little point in leaving this issue open.