Sunday, June 22, 2008

Issue on VMware server with RAID driver

I'm running CentOS 5.1 64-bit on a Dell PowerEdge 1900 with a Perc 5/i.

It's running VMware 1.0.5. It been running fine for about 4 weeks. I got a call today and the server was not responding. On the console I got:

sd 0:2:0:0 rejecting I/0 to offline device

EXT3-fs error on sda3 (my root partition). VMs run on same RAID5 logical drive, but in a LVM partition.

I could not login via console or ssh. Rebooted and the server seems to be running fine.

Found the following in the TTY_000000.log

06/21/08 11:50:06: *** PREFETCH ABORT exception: sf=a00fff18
06/21/08 11:50:06: CPSR=80000013: N=1,Z=0,C=0,V=0,Q=0 I=0 F=0 T=0 MODE=13(SVC)
06/21/08 11:50:06: r0=00000001 r1=0000004a r2=0000004b r3=fffffffd
06/21/08 11:50:06: r4=0000ffff r5=40040000 r6=00000000 r7=00000000
06/21/08 11:50:06: r8=a0e20320 r9=a0d48360 r10=000001bc r11=a00fff90
06/21/08 11:50:06: r12=00008d84 lr=a0b81de0 pc=a087de88
EVT#01762-06/21/08 11:50:06: 15=Fatal firmware error: Line 1014 in ../../raid/verdeMain.c

[0]: fp=a00ffea0, lr=a0898bbc - abort_prefetch+114
[1]: fp=a00ffef0, lr=a000abac - dbits+1788198
[2]: fp=a00fff14, lr=a000a74c - dbits+1787d38
[3]: fp=a00fff5c, lr=a087de88 - set_state+380
[4]: fp=a00fff90, lr=a087d9c8 - raid_task+318
[5]: fp=a00fffb8, lr=a08978b0 - main+3b0
[6]: fp=a00fffe4, lr=a0895fe0 - c_start+30
[7]: fp=a00ffffc, lr=9e8804cc - _start+6c
[8]: fp=a0018350, lr=a0006204 - dbits+17837f0
[9]: fp=a00183fc, lr=4a8 - 000004a8
MonTask: line 1014 in file ../../raid/verdeMain.c
INTCTL=16c00000:1003dcf, IINTSRC=0:0, FINTSRC=0:2000, CPSR=600000d3, sp=a00ffbe4
MegaMon> .....

Running the following:
Perc Firmware version 5.2.1-0067
Driver Version:

I also notice in the log for every boot I get the following regarding the storage environment

T0: LSI Logic MegaRAID firmware loaded
T0: Firmware version 1.03.40-0316 built on Sep 14 2007 at 03:16:20
T0: Board is type 1028/0015/1028/1f03

T0: Initializing 1MB memory pool
T0: EVT#01763-T0: 0=Firmware initialization started (PCI ID 0015/1028/1f03/1028)
T0: EVT#01764-T0: 1=Firmware version 1.03.40-0316
T0: Authenticating RAID key: Done!
T0: EepromInit: Family=33, SN=ed462f020000
T0: Waiting for Expansion ROM to load (ESC to bypass)...mfiIsr: idr=00000008
T48: done.....

It looks like this problem is related to the storage driver to me but I don't know what else to look at. As far as I can tell all the drivers are up-to-date.

Update: Ended up scrapping this server due to other performance reasons. Above error was probably an issue with the driver and I didn't want to mess with dealing with Dell on getting a fix. I do remember getting an email months later from someone at the Dell linux listserv and I believe there was a fix for it.

Thursday, June 5, 2008

Blank Email (D107, 8F07 errors) after migration to VMware
Strange issue.

We cloned a Netware 6.5 SP5 Small Business Server to Vmware Server platform. Everything is running fine except...

Moved Grpwise 7.0.2 post office and domain to workstation with agents turned off. Turned off physical server and turned on virtual clone and copied the groupwise files to it (the clone was made earlier and prepared for migration (removing services not needed)...Agents turned off and copied using simple xcopy.

After migration on 5/31 users found that some but not all emails between 5/17 and 5/31 were blank. Subject, to and from, size etc all shows, but no message text, attachments..when trying to access properties of email get D107 on the client and a CF07 error on the POA. After changing the wphost.db file on the workstation with the copy of groupwise system to allow direct access, the 7.0.2 client shows the email fine. I did a diff of these two grpwise locations and found some blobs in offiles of the copy to be missing in the live system..I copied the missing blobs to the live system but no difference. I copy the workstation grpwise system to a restore location and set the POA, and the user can selectively restore email that was blank by going to File|Open Backup (after deleting the blank emails in the live system first)

I thought that was the end of it...but in the last 4 days we've seen new blank messages...only 2 (22-user system)..they exhibit the same symptoms.

Is this a corruption of blobs? I did do a check of file sizes between the two systems and there were no discrepancies.

I'm concerned that it still appears to be happening...and I'm not sure where to look next.

Saving Groupwise from Netware to USB HD

The backup of Groupwise is an issue. Need it backed up daily. It's 60GB in size. Over the network that takes about 4.75 hours...same if it's sent to a virtual machine on same host using 1000BaseT or to Windows 2000 server with Netware client installed and 100BaseT.

Successful during the day but when I set it up as scheduled task it didn't complete.. Noticed that the server util in Monitor.nlm was stuck around 99%. Hard drive light on vmware server console constant green.

Yesterday, when suspending, lvm snapshot, and then resume found vmware reported internal consistency error on one of the vmdk files. It eventually loaded but when I did an emergency backup using DBcopy the same thing as above happened. Problem is though I don't know if it was like that because of dbcopy or the corruption. Performed an Xcopy with the groupwise agents closed...5 hours later with a good backup (at least no copy failures returned). A restart and the server came up green light or high util...

I did a dbcopy last night and it stopped after 2 hours...the server was high util and green light on hard drive. I've read reports that dbcopy could get stuck on some files and a reboot of the server was required. I updated my Netware client from 4.9 sp1 to sp4 in hopes this fixes that.. The USB hard drive that I copy too was also in a funky state: couldn't copy of some files to it.. After the reboot the drive had CHK files in it...4GB worth.

Here's the batch file I'm trying to use. I also reduced the number of threads to 1:

cd \batch

DATE /T >>gwfullback.log
TIMe /T >>gwfullback.log

ECHO START Full copy >>gwfullback.log

net use g: \\tpm-fs1\vol1 gwusertpm /user:gwuser

rmdir /S /Q e:\grpwise\backC

move e:\grpwise\backB e:\grpwise\backC
move e:\grpwise\backA e:\grpwise\backB

dbcopy /v /t-1 g:\grpwise\TPM_DOM e:\grpwise\backA\TPM_DOM
dbcopy /v /t-1 g:\grpwise\TPM_POST e:\grpwise\backA\TPM_POST

net use g: /delete

ECHO DONE Full copy >>gwfullback.log
DATE /T >>gwfullback.log
TIME /T >>gwfullback.log

It's taking longer to copy with low thread count but copied during the day...about 7 hours.

Delay start of VMs

Using the following:

Set this in the /etc/vmware/config file:
autoStart.defaultStartDelay= "180" ��Start delay of 3 Min's
autoStart.defaultStopDelay= "180" ��Stop delay of 3 Min's
Set the start and stop order of the virtual machines in the individual VMs vmx file.

autostart= "poweron"
autostart.order= "10" ��Set VM#1 to 10, VM#2 to 20....etc
autostop.order= "10" ��Set VM#1 to 10, VM#2 to 20....etc

Haven't tested not sure if this means it will start a VM wait 3 minutes and start the next...I'm assuming.