Fun with stripe sets


I had two 100 gig western digital drives striped and used it as my boot
drive. It was lots of fun when I upgraded my motherboard and had to hack like mad
to update the raid driver so it could boot. But I got to have even more fun lately.

A few weeks ago the stripe set just totally screwed me. I did eventually fix it
but it took a lot of time. Somebody might find the story entertaining so here it is.

The hardware involved is two WD1000BB 100 gig drives striped on an ASUS A7V333
with integrated Promise Fasttrack133 controller.

I was working away updating my mp3car webpage and my machine rebooted, losing
my unsaved changes.

Then it did it again a few minutes later, only this time it hung loading win2k. I
tried everything to make it boot, safe mode, recovery console, old hardware profiles,
etc, always hangs.

I tried booting to \winnt2, a clean copy of win2k I had used before. Hangs.

Tried to reinstall a second operating system to \winnt2, hangs.

I had a win2k on a separate 80 gig bootable drive. This has to work! Hangs.

Everything just hangs the computer at some point.

So I disconnected the raid drives. Win2k boots on the 80 gig. Reconnect it, hangs.

At this point I am getting frustrated. Promise controller says the stripe set is ok
and both drives are detected.

I have a bunch of data recovery tools but they run under nt\2k\xp and I can't boot
any of those with the stripe set connected.

I'm thinking about low level hard drive programs but I have 200 gigs of data I don't
want to trash, not ready to risk it yet.

So I take the stripe set out of the machine and put it into a different system that
has win98 on it. Stripe set detects fine but of course 98 doesn't read NT.

I fire up partition magic. It sees the 200 gig drive but there's nothing I can do
with it here except error check, which (same as every time I use it) reports some
useless error code and can't fix it.

I whip out my trusty sysinternals.com utilities: ntfsdos, ntfs for win98, etc.

Ran the chkntfs that comes with nftsdos. Yay! It chkdsk's the drive. A few security
descriptor issues, nothing serious, but it says there is not enough free space to
repair it, then it hangs. Grrr.

I ran every other version of chkntfs that I could find. Most seem to do a complete
chkdsk and report few, if any, problems. But they all hang at the end after they say
the chkdsk is complete.

So I install ntfs for win98 and rem out the chkntfs from autoexec (since it always
hangs). No go, 98 hangs on boot up loading the ntfs driver. I dug up the ancient
ntfsdos that sysinternals originally wrote by reverse engineering ntfs. This one
let me see the files on my drive. It was essentially useless for data recovery but
it was nice to see a directory listing at least. (The latest ntfsdos editions just
wrap the actual ms ntfs dll, so it has the same hang effect as booting 2k. I'm sure
there's a lesson in that. :)

I'm starting to think this is no longer fun.

Now I get out Runtime's DiskExplorer NTFS. This is basically a disk editor that
understands and navigates the ntfs file system.

I quickly found what's been hanging my systems. NTFS has two copies of the MFT
(master file table). The first copy looked fine, which is why chkdsk runs ok. The
second copy is normally stored on the middle of the drive. Whenever I went to see it,
it hung my machine.

I tried stepping sector by sector towards second MFT, where it hangs, then I tried
stepping backwards, towards the end of the MFT. I found a two cluster area where the
machine hangs. The first cluster is the MFT copy. The second cluster was the
beginning of the ntfs log file. I question the sensibility of locating the log file
immediately after the second MFT, but that is another matter.

So now it all made sense why chkdsk and nt/2k/xp hangs. Chkdsk is certainly going to
want to update the log file at the end of the chkdsk and/or mirror MFT changes to the
backup.

Now that I know there's a bad block on the hard drive, I'm still too scared to run a
low level disk repair utility. They are always documented badly and always say to
back up the data before using them, and of course I can't back up squat.

Luckily diskexplorer has a file/directory/subdir save feature. I would have been up
the creek otherwise. I want to recover the files back onto an ntfs drive but of
course I can't do this in 98. I borrowed an 80 gig maxtor which I put in my main
machine and shared the drive.

Then on the win98 machine I used diskexplorer to save files/directories to the
network share. I saved my winnt folder and a few other important directories.

In the ntfs directories, any directory or file with a long file name is essentially
listed twice, once with the short name and again with the long name. Diskexplorer
(2.0) does not account for this, so subdirectories and files with long names get
copied twice. And it is recursive, so if that subdir contains a subdir with a long
name, it will be copied four times, and since the files in it have long names, they
get copied eight times, and so on. To keep the copies from taking forever I had to
run a batch file which periodically set the readonly attribute on files on the target
drive. Diskexplorer would silently skip copying target files that were readonly.

I would guess the recovery process took 2 to 3 times longer than it should have
because of this nonsense. But at least I was getting my data back.

So now that I had my new 80 gig drive with all the important stuff on it, it of
course refused to boot. Booting to recovery console on my other drive, I tried the
obvious fixboot / fixmbr stuff. No go.

So I started a win2k install to the \winnt2 folder, knowing this would fix the mbr
and stuff. After if was done copying files and rebooted, I got my win2k boot menu
back (you have to be quick because it defaults to booting the new win2k setup stuff)

Win2k booted up without incident and I was pleased to see everything back to normal.

Later on I bought a 160 gig maxtor (with all the 48 bit ATA fun that comes with it)
and finished copying the rest of my files off the stripe set.

I had all my data back, windows was working, and life was good, but the story doesn't
end here. I still had to repair my stripe drives and return the 80 gig I had
borrowed.

I wasted some more time just because I wanted to dick around and test my theory about
the bad block in the MFT copy. Ran disk explorer and edited the MFT entry for the
second MFT and the log file. I changed them to point to an area of the drive that was
in use for a big movie file or something. Then I copied the MFT data onto that area
as the backup MFT. For the logfile, I just zero'd out a whole cluster and hoped
that it would do. I have no idea what a logfile sector is supposed to look like.

The boot record also had to be updated to point to the second MFT. There is also a
copy of the boot record on the last sector of the drive that I had to change as well.

Success! I booted into XP and the stripe set was working fine. Minor victory since it
still has the bad block and the data is already copied off. I ran a file compare
against the stripe set and the data I already copied. There was some minor
inconsistencies, some odd files that were missing on the copied data. But no binary
differences or serious omissions.

Now that I had checked my recovered data it was time to get low level on it's ass and
fix the bad sector. I found a dos utility on the western digital site that does a
complete non-destructive read of a drive (about 40 minutes). Of course the first
drive I scanned didn't have anything wrong with it. I thought maybe a critical
overheat on it had caused the trouble. Guess not. Ran the program on the other drive
and sure enough it stammered a bit reading sectors near the middle. After the
full scan it gave me an 0256 error code which (after stumbling across the list on
their forums) means a bad block was found. Then it gave me the option of relocating
the bad block to a reserve block and it did that real quick. Of course it doesn't
say where the bad block was. Quality software, that.

Damn had I known it was that easy I could have been spared all the earlier work! Next
time... After that I found a fancy new windows 98/nt/2k/xp utility that almost does
the same thing. Except it doesn't do a full read test. It just reads a random
amount of sectors. How stupid is that? Idiots. No reformat option either.
Nice job there guys! I think it's main purpose is to send WD all your drive
information rather than help people fix real problems.

So now the drives are fixed and I have my copies. I'm scrapping the stupid stripe
set. It was never as fast as it was supposed to be anyway.

I cloned (partition magic) the borrowed 80 gig drive back onto both of the 100's
separately. One of the 100's had the partition table for the 200 gig stripe.
Win2k wouldn't boot with it in there (surprise). You can't fdisk it since fdisk
won't touch ntfs partitions. So back to windows 98 and partition magic to clean
the partition table. When it's all over and working I will reformat the other 100
and dump it in a server or something.

I put the 100 back in the main machine and it was booting up and I was happy and then
it got to where the logon screen is supposed to appear and said that awgina.dll could
not be loaded and presented me with a 'reboot' button. Kinda odd since it was a
complete clone of the working 80 gig.

So after a moment or two swearing about that effing useless pcanywhere shit, I got
onto dejanews and found some good articles on the problem. It had nothing to do with
pcanywhere after all.

Turns out that when it loads msgina.dll (or awgina.dll), this is the moment where it
starts using oldskool dos device C: drive letter syntax. My system was insisting on
mounting C: by the 80 gigger's volume ID. With or without that 80 gig drive in my
system, C: was still reserved for it and win2k won't work. So once again I'm booting
back to the 80 gig with the 100 as a secondary drive.

Digging into the registry there is a key that lists the volume ID's of all the
harddrives windows has ever seen. 

HKEY_LOCAL_MACHINE\SYSTEM\MountedDevices

It also maps the dos drive letters to the volume IDs. 
So I open up the registry on the 80 gig and find the volume id of the 100 gig
drive (easy to do, since it's mapped to D:). Exported the registry key and then
edited the file in a text editor. Stripped out everything except the 100 gigger's
entry for volume id and the d:\ entry. Then changed the d: to c:, and modified the
registry path to \zyzzy\.

Then I open up regedt32 and attach the system registry of the 100 gigger to \zyzzy\.
Import the .reg file and voilla, my volume ID and C: designation are all fixed. Shut
down and reboot to the 100 gig drive. Windows boots up just fine. Now I have to
service pack 3 the thing and enable the 48 bit ata in order to use the new 160 gig
maxtor. That works without incident and the entire ordeal is over.

I strongly suspect it's the promise controller or driver that caused the system
hang when the bad sector was encountered. That REALLY pisses me off.

It's also no help that SMART does not work through the promise controller.
Brilliant! Isn't SMART one of the major things you want in a raid system?
(Note that in this case, SMART never indicated any problems.)
The promise bios takes freakin' forever to detect drives every boot-up,
can't they take .01 seconds to do a SMART check? I have to assume they are just
complete assholes plain and simple. At least the latest ASUS bios finally lets you
bypass the promise bios if you aren't using it.

Email me and tell me what a jerk I am or whatever you like.

Created 03 September 2002


Copyright © 2002 By Sean McLaughlin All Rights Reserved.

Email: Seanster@Seanster.com

www.Seanster.com/raid/raid_stripe_fun.html

//End of File