[rhn-users] RH ES 4.2 x86_64 MD RAID-1 file systems lockups

Jed Donnelley jed at nersc.gov
Wed Feb 1 21:23:41 UTC 2006


Redhat v 4.2 x86_64 MD RAID-1 users,

I've recently run into a simple situation where the file system on some of
my Redhat v 4.2 x86_64 systems:

[jed at sbuild7 ~]$ uname -a
Linux sbuild7.nersc.gov 2.6.9-22.0.1.ELsmp #1 SMP Tue Oct 18 18:39:02 
EDT 2005 x86_64 x86_64 x86_64 GNU/Linux
[jed at sbuild7 ~]$ cat /etc/redhat-release
Red Hat Enterprise Linux ES release 4 (Nahant Update 2)
[jed at sbuild7 ~]$  free
              total       used       free     shared    buffers     cached
Mem:       4041364    4016180      25184          0     150096    3405148
-/+ buffers/cache:     460936    3580428
Swap:      8388600        176    8388424
[jed at sbuild7 ~]$

hangs up during a somewhat long file operation.

If I execute the following in an MD RAID-1 file system:

$dd if=/dev/zero of=testfile bs=1024 count=8388608 &

(create an 8GB file full of zeros) and then wait for a minute or
so while it gets going, then when I do much of anything, e.g.

$du -ks *

$top

$ps

$ls
...

All my commands hang.  The system is unresponsive to remote
ssh connections, etc.  It pretty much freezes up for the duration of
that dd.  I can't break out of the command (e.g. if I run it synchronously),
can't kill it (though interestingly in some cases I can destroy the file
out from under it).

I've repeated this test numerous times now on three systems.  They do
all use MD RAID-1 file systems.  If I try this exercise on a RAID-0 file system
it doesn't hang up.  Also I've tried the same test on a Redhat ES 4.2
32-bit system and I don't run into this problem.  The two test systems
that I used had 4GB and 8GB of real memory and 8GB and 4GB
of swap (respectively).

I did some testing with smaller file sizes, 1GB, 2GB, and 4GB and didn't
run into this problem.  When I create an 8GB file as above it hangs 
consistently.

When the dd finally completes (many minutes - perhaps 10?) the system
comes unhung and appears back to normal.  Note that I certainly don't
recommend doing this test on a production system.

Of course I would expect some resource contention with such a large file
operation going on.  However, I wouldn't expect the system to hang up
as it does.

I'd be interested to hear the experience of others that might have
an opportunity to perform this test (RH E 4.x x86_64 MD RAID-1,
execute the command:  dd if=/dev/zero of=testfile bs=1024 count=8388608 &
and after a minute or two execute some commands that will touch
the same file system) and see what happens.  At this point I'm
not sure whether I should move ahead with these systems or to
back off on some aspect of their configuration (e.g. x86_64).

--Jed http://www.nersc.gov/~jed/
--Jed http://www.nersc.gov/~jed/




More information about the rhn-users mailing list