Sunday, June 22, 2008

Issue on VMware server with RAID driver

I'm running CentOS 5.1 64-bit on a Dell PowerEdge 1900 with a Perc 5/i.

It's running VMware 1.0.5. It been running fine for about 4 weeks. I got a call today and the server was not responding. On the console I got:

sd 0:2:0:0 rejecting I/0 to offline device

EXT3-fs error on sda3 (my root partition). VMs run on same RAID5 logical drive, but in a LVM partition.

I could not login via console or ssh. Rebooted and the server seems to be running fine.

Found the following in the TTY_000000.log

06/21/08 11:50:06: *** PREFETCH ABORT exception: sf=a00fff18
06/21/08 11:50:06: CPSR=80000013: N=1,Z=0,C=0,V=0,Q=0 I=0 F=0 T=0 MODE=13(SVC)
06/21/08 11:50:06: r0=00000001 r1=0000004a r2=0000004b r3=fffffffd
06/21/08 11:50:06: r4=0000ffff r5=40040000 r6=00000000 r7=00000000
06/21/08 11:50:06: r8=a0e20320 r9=a0d48360 r10=000001bc r11=a00fff90
06/21/08 11:50:06: r12=00008d84 lr=a0b81de0 pc=a087de88
EVT#01762-06/21/08 11:50:06: 15=Fatal firmware error: Line 1014 in ../../raid/verdeMain.c

[0]: fp=a00ffea0, lr=a0898bbc - abort_prefetch+114
[1]: fp=a00ffef0, lr=a000abac - dbits+1788198
[2]: fp=a00fff14, lr=a000a74c - dbits+1787d38
[3]: fp=a00fff5c, lr=a087de88 - set_state+380
[4]: fp=a00fff90, lr=a087d9c8 - raid_task+318
[5]: fp=a00fffb8, lr=a08978b0 - main+3b0
[6]: fp=a00fffe4, lr=a0895fe0 - c_start+30
[7]: fp=a00ffffc, lr=9e8804cc - _start+6c
[8]: fp=a0018350, lr=a0006204 - dbits+17837f0
[9]: fp=a00183fc, lr=4a8 - 000004a8
MonTask: line 1014 in file ../../raid/verdeMain.c
INTCTL=16c00000:1003dcf, IINTSRC=0:0, FINTSRC=0:2000, CPSR=600000d3, sp=a00ffbe4
MegaMon> .....

Running the following:
Perc Firmware version 5.2.1-0067
Driver Version: 00.00.03.16

I also notice in the log for every boot I get the following regarding the storage environment

T0: LSI Logic MegaRAID firmware loaded
T0: Firmware version 1.03.40-0316 built on Sep 14 2007 at 03:16:20
T0: Board is type 1028/0015/1028/1f03

T0: Initializing 1MB memory pool
T0: EVT#01763-T0: 0=Firmware initialization started (PCI ID 0015/1028/1f03/1028)
T0: EVT#01764-T0: 1=Firmware version 1.03.40-0316
T0: Authenticating RAID key: Done!
T0: EepromInit: Family=33, SN=ed462f020000
T0: Waiting for Expansion ROM to load (ESC to bypass)...mfiIsr: idr=00000008
T48: done.....

It looks like this problem is related to the storage driver to me but I don't know what else to look at. As far as I can tell all the drivers are up-to-date.

Update: Ended up scrapping this server due to other performance reasons. Above error was probably an issue with the driver and I didn't want to mess with dealing with Dell on getting a fix. I do remember getting an email months later from someone at the Dell linux listserv and I believe there was a fix for it.

No comments: