[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

Re: External Journal scenario - good idea?

Jeremy Rumpf wrote:

On Wednesday 30 October 2002 07:44 am, Vinnie wrote:

Currently, the array is partitioned with a /boot partition, and a /
partition, each as ext3 with the default data=ordered journaling mode.
I have begun to realize gradually why it is a decent idea to break up
the filesystem into separate mount points and partitions, and may yet
end up doing that.  But that's a rabbit to hunt another day, unless
taking care of this is also required to solve this problem.

This is _very_ adviseable.

Yep now (I think) I understand. Since I have one large / filesystem, all writes go through the same "funnel". All writes have to use the same journal, going to the same "drive" (array). Since the same drives are involved writing to the shared dirs for SMB clients, as those which are involved with reads/writes to NFS mailbox dirs and other stuff, NFS requests and MySQL requests have to "get in line" with SMB requests when it's busy.

But if these other requests (NFS mailboxes, MySQL, etc.) are on separate spindles, drives which are not part of the RAID5 array, they are in a different line waiting to be processed. This makes sense.

This file server performs 5 key fileserver-related roles, due to its
having the large RAID5 file storage for the network:

1. Serves the mailboxes for our domain to the two frontend mail/web
servers via NFS mount

2. Runs the master SQL server - the two mail/web servers run local slave
copies of the mail account databases

3. Stores the master copy of web documents served by the web servers
(and will replicate them to web servers when documents change, still
working on this though)

4. Samba file server for storage needs on the network

5. Limited/restricted-access FTP server for web clients

Do any of these require more than 120GB of storage (meaning are they too large to fit on a single 120GB RAID1 set)?

Currently our complete usage of the single RAID5 array is right around 100GB. It is mostly file storage/backups from other hosts on the network. This will no doubt represent the largest file storage requirements of all the fileserver functions for this machine.

In light of the smaller amount of space really needed for all of the other functions (combined), and the fact that for each 120GB drive we pull off the RAID5 array we will lose around 100GB of RAID5 storage capacity (though the drives would have to be removed from the array in PAIRS for each RAID1 array we were to create in this external 8-bay unit), it seems that the best usage of the external RAID enclosure and the 120GB drives we have in it, would be to create the other arrays elsewhere, and keep the large array for file storage. If I am to keep a RAID5 array going - I'm going to have to think about this some and decide if I can settle for something else, like a RAID0+1 array, or smaller RAID1 arrays.

As you said, using a pair of 120GB drives for each RAID1 array used for other storage purposes (mailboxes, ftp, SQL database) would be a really big waste of space.

Also, I'm not so sure I would be gaining much advantage to make RAID1 arrays in the same external unit, assuming I still had a RAID5 array in the same unit. That is, if what I am seeing has much or anything to do with the parity calculation speed of the RAID controller in this external subsystem. If it is swamped with XOR calculations while writing to a 7 drive array, it would probably not be much less swamped calculating parity data for a 4-5 drive array, and even a separate RAID1 array working behind the same RAID controller may suffer write performance issues because the data has to be processed by the same RAID controller to actually get written to the RAID1 drives.

But I am really not even sure that what we're seeing here is a problem with the speed of the RAID controller. From some other reading I have done, it seems that grabbing up RAM to cache writes and combine it all into one big write is something that the 2.4 kernel series is rather notorious for. I saw an article/review of external RAID subsystems (both SCSI and ATA-to-SCSI type) which said the same thing - that Windows 2000 servers were a lot better at asynchronous I/O than kernel 2.4-based Linux, and proceeded to describe much of the same malady I have been seeing here. They did say that a lot of work is going into newer Linux kernels to make it better at async disk I/O.

I did try building a 2.4.19 kernel this past weekend, and it crashed MISERABLY during a large write test. Major SCSI driver error messages, and it hung the SCSI bus to the point that I had to not only hit the reset button on the server, but also cycle the power on the RAID unit, before I could successfully RE-boot. I saw in the Changelogs for 2.4.19 that the Adaptec 78xx drivers have been revamped a couple times since 2.4.18. I guess I'm just going to have to stay with 2.4.18 for a while.

I have performed the recommended bdflush sysctl tweek to try to make the kernel write dirty buffers more often, and while I am seeing a marked increase in SCSI bus activity, write performance doesn't seem to have improved a great deal. But from the "free" command (and this has always been the deal), it's not the "buffers" RAM usage that is so high when heavy disk write I/O is going on, its the "cached" RAM usage that hits the roof.

I am going to split up the single large filesystem into multiple mounts as you suggested, as this much more clearly (thanks to your reply) is a good idea. But I am concerned that even after doing this, since it is the same kernel with its same "cache it first, then write it all at once" semantics, that I may not be in much better shape.

It's really a shame to suspect so strongly that I would get the most improved write performance out of this machine by dropping from 2GB of RAM to 256MB. ;) Operating on the concept that if it has nowhere to cache it, it HAS to write more often... ;)

Remember though, you can move the journal to an external device at any time. I would heavily recommend that you break up your spindles and allocate the journal with the filesystem (a large journal with the filesystem) to start out with. Then if performance still demands it, grab some small(er) disks and move the journals off to them.

When I say large journal, I usually think around the 250MB range. I personally wouldn't recommend allocating a super large (greater than 1GB), but I'll reside and let the FS experts advise on that issue.

I was considering the massive journal size for the samba share mount on the idea that if the journal is big enough to be a "staging area" for file copy operations from clients that may total out around 2GB or more (possibly), that we could keep the journal commit activity largely an asynchronous operation, rather than a chain of panic-mode synchronous operations because we are straddling that 25-50 percent full trigger until the data stops coming from the client machine.. But I'm not 100% sure I understand how it all works just yet, I have to do some more reading. It could actually be counter-productive to have such a large journal.

One other snag it seems we may run into is the fact that the / partition
already has a journal (/.journal, I presume), since it's already an ext3
partition.  Is it possible to tell the system we want the journal
somewhere else instead?  Strikes me that when we're ready to move to the
external journal, we may have to mount the / partition ext2, then remove
the journal, and create the new one and point the / partition to it with
the e2fs tools?

Yes, except I would _not_ advise moving the / partition journal to an external device. The / partition should have very little activity (assuming /var or /var/log is a separate file system). This is the prime reason you should not be allocating one huge / filesystem. Break it up into something like:


So on these (above), have them at least on separate partitions. Possibly the same drive, but at least separate partitions? (which would give them separate journals). And on the ones below:

and create special mounts for your samba, mysql, webroot (NFS), mail (NFS), stuff.


since this is where the majority of the real file activity is going on, put each of these on separate drives (or RAID1 arrays), so we not only have separate journals, but separate spindles too) ?

Jeremy thank you so much for your reply. This has really given me a lot to chew on. And looking at my watch I see that it's Friday again.. meaning I can actually work on this for a few days... <grin>.


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]