Bug 748

Summary: 185.18.31 nvidia kmod dereferences a null pointer & crashes on launch
Product: Fedora Reporter: Joe Christy <joe>
Component: nvidia-kmodAssignee: Nicolas Chauvet <kwizart>
Status: RESOLVED FIXED    
Severity: normal CC: belegdol, fedora, jhoward, laurent.aguerreche, lt73, mvanross, s.adam
Priority: P5    
Version: 11   
Hardware: All   
OS: GNU/Linux   
namespace:

Description Joe Christy 2009-08-05 22:56:22 CEST
I have a Lenovo Thinkpad W700 w/ 1GB Nvidia Quadro 3700 GPU, and Core 2 Extreme Duo CPU, running Fedora 11.

Everything worked fine with the 185.18.14 nvidia rpms from RPM Fusion Nonfree and the 2.6.29.6-213.fc11.x86_64 kernel, but when I upgraded to the 2.6.29.6-217.2.3.fc11.x86_64 kernel, I began having many problems with the Nvidia graphics, which refuse to run.

I am using kmod-nvidia-2.6.29.6-217.2.3.fc11.x86_64-185.18.31-1.fc11.x86_64.rpm, akmod-nvidia-185.18.31-1.fc11.x86_64.rpm, xorg-x11-drv-nvidia-185.18.31-1.fc11.x86_64.rpm, and xorg-x11-drv-nvidia-libs-185.18.31-1.fc11.x86_64.rpm. I find that the Xserver is always crashing on launch, throwing these kernel errors to syslog:

Aug  5 13:30:20 shango kernel: BUG: unable to handle kernel NULL pointer dereference at (null)
Aug  5 13:30:20 shango kernel: IP: [<(null)>] (null)
Aug  5 13:30:20 shango kernel: PGD 259050067 PUD 259049067 PMD 0 
Aug  5 13:30:20 shango kernel: Oops: 0010 [#1] SMP 
Aug  5 13:30:20 shango kernel: last sysfs file: /sys/devices/platform/dock.0/docked
Aug  5 13:30:20 shango kernel: CPU 1 
Aug  5 13:30:20 shango kernel: Modules linked in: rfcomm sco bridge stp llc bnep l2cap sunrpc coretemp ip6t_REJECT nf_conntrack_ipv6 ip6table_filter ip6_tables ipv6 cpufreq_ondemand acpi_cpufreq freq_table dm_multipath uinput nvidia(P) snd_hda_codec_conexant arc4 ecb snd_hda_intel snd_hda_codec uvcvideo iwlagn snd_hwdep sdhci_pci videodev sdhci yenta_socket snd_pcm iwlcore mmc_core snd_timer firewire_ohci ricoh_mmc firewire_core snd e1000e ata_generic v4l1_compat soundcore i2c_i801 video wmi rsrc_nonstatic pata_acpi lib80211 btusb v4l2_compat_ioctl32 joydev mac80211 thinkpad_acpi bluetooth iTCO_wdt output iTCO_vendor_support cfg80211 snd_page_alloc i2c_core pcspkr crc_itu_t hwmon [last unloaded: microcode]
Aug  5 13:30:20 shango kernel: Pid: 2430, comm: X Tainted: P           2.6.29.6-217.2.3.fc11.x86_64 #1 2757CTO
Aug  5 13:30:20 shango kernel: RIP: 0010:[<0000000000000000>]  [<(null)>] (null)
Aug  5 13:30:20 shango kernel: RSP: 0018:ffff88025891fce0  EFLAGS: 00010292
Aug  5 13:30:20 shango kernel: RAX: ffff88024a266000 RBX: ffff8802540eafa8 RCX: 0000000000000001
Aug  5 13:30:20 shango kernel: RDX: ffff88024b501d54 RSI: ffff88024a266000 RDI: ffff880252962000
Aug  5 13:30:20 shango kernel: RBP: ffff8802540eaf60 R08: ffff880252962000 R09: ffff8802540eafe8
Aug  5 13:30:20 shango kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffff88024b501d50
Aug  5 13:30:20 shango kernel: R13: ffff880252962000 R14: 00007fffc0df6290 R15: 0000000000000000
Aug  5 13:30:20 shango kernel: FS:  00007f3da3d357b0(0000) GS:ffff88025aaee500(0000) knlGS:0000000000000000
Aug  5 13:30:20 shango kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Aug  5 13:30:20 shango kernel: CR2: 0000000000000000 CR3: 000000025959f000 CR4: 00000000000006e0
Aug  5 13:30:20 shango kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Aug  5 13:30:20 shango kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Aug  5 13:30:20 shango kernel: Process X (pid: 2430, threadinfo ffff88025891e000, task ffff88024c17ae00)
Aug  5 13:30:20 shango kernel: Stack:
Aug  5 13:30:20 shango kernel: ffffffffa025b577 ffff8802540eafa8 00000000c1d00005 000000000000000c
Aug  5 13:30:20 shango kernel: 00007fffc0df6290 0000000000000110 ffffffffa025d2da 0000000000000020
Aug  5 13:30:20 shango kernel: ffffffffa06576d1 0000000000000110 ffffffffa06572d9 0000000000000110
Aug  5 13:30:20 shango kernel: Call Trace:
Aug  5 13:30:20 shango kernel: [<ffffffffa025b577>] ? _nv007272rm+0xaa/0x314 [nvidia]
Aug  5 13:30:20 shango kernel: [<ffffffffa025d2da>] ? _nv019651rm+0x16/0x1c [nvidia]
Aug  5 13:30:20 shango kernel: [<ffffffffa06576d1>] ? _nv003836rm+0x9/0xe [nvidia]
Aug  5 13:30:20 shango kernel: [<ffffffffa06572d9>] ? _nv003802rm+0x179/0x1c2 [nvidia]
Aug  5 13:30:20 shango kernel: [<ffffffffa0657616>] ? _nv003838rm+0x98/0x13e [nvidia]
Aug  5 13:30:20 shango kernel: [<ffffffffa058b23e>] ? _nv007070rm+0x19/0x25 [nvidia]
Aug  5 13:30:20 shango kernel: [<ffffffffa03990a6>] ? _nv003698rm+0x576/0x5b4 [nvidia]
Aug  5 13:30:20 shango kernel: [<ffffffffa0610181>] ? rm_ioctl+0x2f/0x67 [nvidia]
Aug  5 13:30:20 shango kernel: [<ffffffffa06e607f>] ? nv_kern_ioctl+0x307/0x36a [nvidia]
Aug  5 13:30:20 shango kernel: [<ffffffffa06e6128>] ? nv_kern_unlocked_ioctl+0x21/0x25 [nvidia]
Aug  5 13:30:20 shango kernel: [<ffffffff810e0ee7>] ? vfs_ioctl+0x22/0x87
Aug  5 13:30:20 shango kernel: [<ffffffff810e13cf>] ? do_vfs_ioctl+0x462/0x4a3
Aug  5 13:30:20 shango kernel: [<ffffffff810e1466>] ? sys_ioctl+0x56/0x79
Aug  5 13:30:20 shango kernel: [<ffffffff8101133a>] ? system_call_fastpath+0x16/0x1b
Aug  5 13:30:20 shango kernel: Code:  Bad RIP value.
Aug  5 13:30:20 shango kernel: RIP  [<(null)>] (null)
Aug  5 13:30:20 shango kernel: RSP <ffff88025891fce0>
Aug  5 13:30:20 shango kernel: CR2: 0000000000000000
Aug  5 13:30:20 shango kernel: ---[ end trace 27415fb262e3d769 ]---
Aug  5 13:30:28 shango kdm[2378]: X server startup timeout, terminating

Curiously neither the Xorg.0.log nor the kdm.log show any problems.

Before I upgraded the nvidia packages to the latest, I had installed akmod-nvidia-185.18.14-1.fc11.x86_64.rpm, etc. and for some reason, the akmod failed to build a new kernel module for the 2.6.29.6-217.2.3.fc11.x86_64 kernel.

I need to work now, so I plan to roll back to the older kernel and Nvidia packages. I would be willing, as time allows, to do more debugging in the near future.
Comment 1 Joe Christy 2009-08-06 02:04:59 CEST
Could this be a null pointer problem that has existed for a long time, but is just now exposed by this kernel change:

* Wed Jul 29 2009 Chuck Ebbert <cebbert@redhat.com> 2.6.29.6-217.2.3
- Don't optimize away NULL pointer tests where pointer is used before the test.
  (CVE-2009-1897)
Comment 2 Nicolas Chauvet 2009-08-06 09:05:13 CEST
You need to report the bug to nvidia using nvidia-bug-report.sh
Alternatively could you go back to the previous kernel using the new driver ?
Comment 3 Nicolas Chauvet 2009-08-06 11:26:41 CEST
Others have seems to have reported the nvidia driver to work with 185.18.31.
It could be a problem specifics from your hardware...

Can you confirm that :
- 185.18.31 works with the previous kernel ?
- 185.18.18 can be built using the current fedora kernel ? (if not what is the error log).


You could have experienced a driver version missmatch because of the use of akmod-nvidia over kmod-nvidia for standard Fedora kernel. Maybe you could test that the version is right for the nvidia module to match the xorg module.

Comment 4 Lee Trager 2009-08-07 10:42:47 CEST
I seem to be effected by this bug as well. When I boot into any kernel with the 185.18.31 NVIDIA driver using either kmod or akmod I get the same thing. The system loads up just fine but when it starts X my screen turns off. I can not change to another terminal to debug which makes things a little difficult. I am however able to ssh in so the machine is not dead. I get the same kernel opp's as above and my Xorg.log says no screen's found, obviously because of the nvidia kernel driver oppsing.

It seems that this is only effects Quadro users. I asked a friend who has a Gefore 8600 and everything works fine for them. Below is the PCI ID for my video card.

01:00.0 VGA compatible controller: nVidia Corporation Quadro FX 570M (rev a1)

The only way I was able to fix this was by disabling the NVIDIA driver. I did this by removing the NVIDIA process from startup and running nvidia-config-display disable. I then get the vesa driver which isn't what I want but at least I have X.

A quick search on nvnews.net shows that an upgrading to the latest NVIDIA beta driver fixes this issue. I haven't tried that yet.

http://www.nvnews.net/vbulletin/showthread.php?t=136959
Comment 5 Nicolas Chauvet 2009-08-07 11:00:44 CEST
I will re-introduce either beta and/or newest series for quadro users until nvidia update the stable drivers. 

Comment 6 Jim 2009-08-07 16:21:16 CEST
(In reply to comment #5)
> I will re-introduce either beta and/or newest series for quadro users until
> nvidia update the stable drivers. 
> 

01:00.0 VGA compatible controller: nVidia Corporation Quadro FX 360M (rev a1)

I am having the same issue.  Glad to see a fix is coming.
Comment 7 Yusuf Altin 2009-08-07 16:40:42 CEST
I have the same Problem on different Hardware. 

I am using a Sony Vaio FE41M Notebook, 32 Bit F11, 

my graphic card is

01:00.0 VGA compatible controller: nVidia Corporation G72M [GeForce Go 7400] (rev a1)

The Problem exists with the previous kernel, too.

After downgrade to Nvidia 185.18.14 everything is ok.
Comment 8 Joe Christy 2009-08-07 17:27:29 CEST
To confirm what others have already found, I systematically:

0) downloaded the 185.18.14 and 185.18.31 nvidia-kmod SRPMs, and built both without detecting  any issues using "akmodsbuild -v -v -v -v".

then:

1) yum remove \*nvidia\*
2) yum localinstall  akmod-nvidia-<N.N.N-M> xorg-x11-drv-nvidia-<N.N.N-N> xorg-x11-drv-nvidia-libs--<N.N.N-N>
3) reboot

for N.N.N-M,N.N.N-N = 185.18.14-1,185.18.14-3 and 185.18.31-1,185.18.31-1 respectively, under kernels 2.6.29.6-213.fc11.x86_64 and 2.6.29.6-217.2.3.fc11.x86_64

I found that the 185.18.14 nvidia packages worked with both kernels and the 185.18.31 packages failed in identical fashions, consistent with all prior comments, with both kernels.

Unfortunately, just as I was finishing the tests, but before reporting on the results, my family sent out a search party to bring me back.

Thanks, Nicolas, for offering to package the latest nvidia drivers; I look forward eagerly to trying the 190.xx packages when you do. Until then I'm happy using the 185.18.14 drivers.

Is there any point now to submiting a bug report directly to Nvidia, as they seem to be aware of, and have fixed, the problem?
Comment 9 Nicolas Chauvet 2009-08-07 19:37:08 CEST
OK, so here is the choosen solution:
I have reverted the kmod-nvidia as 185.18.14 and built it for the lastest (current) Fedora kernel. The 185.18.31 driver will be moved back to the rpmfusion-nonfree-updates-testing repository.
If you have problem with the lastest driver, you will have to remove it with:
yum remove xorg-x11-drv-nvidia
then you reinstall (that will be the 185.18.14 in few hours) using
yum install kmod-nvidia(-PAE)

Thoses who did not experienced problem with the last driver will stay with 185.18.31.

On next fedora kernel, the kmod will be built for 185.18.31 or what would be next nvidia driver (from stable serie).

Hopefully early next week, I will re-introduce the nvidia-beta driver serie with a new packaging scheme. That will be versioned as 190.19


>Is there any point now to submiting a bug report directly to Nvidia, as they
>seem to be aware of, and have fixed, the problem?
You can learn how to report a problem to nvidia here:
http://www.nvnews.net/vbulletin/showthread.php?t=46678

Comment 10 Joe Christy 2009-08-17 17:23:01 CEST
Nicolas

WRT the chosen solution in Comment #9, are there any projections about *which* next week we will see the nvidia-beta driver serie in ;)?

I'm fine w/ the akmod rolling ne kmods w/ the newer kernels; I'm just curious.
Comment 11 Nicolas Chauvet 2009-08-17 17:26:09 CEST
(In reply to comment #10)
> Nicolas
> 
> WRT the chosen solution in Comment #9, are there any projections about *which*
> next week we will see the nvidia-beta driver serie in ;)?
I will be in holiday without inet access tommorrow and for two weeks
Today will be very short to receive the first feedbacks if ever the new packaging scheme was ready.

For now, you will have to update the rpm on you own, (not many things to change).

Comment 12 Mark van Rossum 2009-08-29 13:47:50 CEST
See also 

https://bugs.launchpad.net/ubuntu/+source/nvidia-graphics-drivers-180/+bug/417373


Basically 
185.18.31 is bad for some machines
185.18.36 is ok

Comment 13 Nicolas Chauvet 2009-09-01 15:10:17 CEST
You can test 185.18.36 that have be moved to rpmfusion-nonfree-updates-testing either testing a 2.6.30 kernel that is in Fedora updates-testing repository, or using a module built for 2.6.29 from the current updates-stable.

I will close the bug once 185.18.36 hit stable.
Comment 14 Nicolas Chauvet 2009-09-07 22:13:35 CEST
185.18.36 is pushed to stable, this bug should be fixed.