Whitepaper: Red Hat's New Journaling File System: ext3

January 1, 2010

Michael K. Johnson

In Red Hat Linux 7.2, Red Hat provides its first officially supported journaling file system: ext3. The ext3 file system is a set of incremental enhancements to the robust ext2 file system that provide several advantages. This paper summarizes some of those advantages (first in general terms and then more specifically), explains what Red Hat has done to test the ext3 file system, and (for advanced users only) touches on tuning.


Tuning suggestions

Choosing elevator settings
The ext3 file system acts a bit differently than the ext2 file system, and the differences can appear in various ways. Advanced users may choose to tune the file system and I/O system for performance. This is an introduction to some of the more common tuning that advanced users may wish to try. All tuning, of course, needs to be done in the context of performance testing of specific applications; there is no "one size fits all" approach to tuning. This is, however, intended to provide some generally useful information.

Most Linux block device drivers use a generic tunable "elevator" algorithm for scheduling block I/O. The /sbin/elvtune program can be used to trade off between throughput and latency. Given similar loads, the ext3 file system may require smaller latency numbers as provided to the /sbin/elvtune program in order to provide similar results to the ext2 file system.

In some cases, attempting to tune for maximum throughput at the expense of latency (in this case, large read latency (-r) and write latency (-w) numbers used with the /sbin/elvtune program) can actually decrease throughput while increasing latency. This effect is more pronounced with the ext3 file system for a variety of reasons.

  • With the ext2 file system, writes are scheduled every 30 seconds; with the ext3 file system, writes are scheduled every 5 seconds. This keeps journal transactions from having a noticeable impact on system throughput and also keeps data on disk more up-to-date.

  • The ext3 file system, by journaling all metadata changes, can magnify the effect of atime changes significantly. You can mount a file system with the noatime flag in order to disable atime updates. While this is not the only source of metadata updates, on many systems, particularly highly-accessed servers with lots of accessed files, atime updates can be responsible for the majority of metadata updates, and on these systems, turning off atime updates may noticeably reduce latency and increase throughput.

In order to tune for our default file system choice of ext3, Red Hat has reduced the default read and write latency numbers to half of the default values (from 8192 read, 16384 write to 4096 read, 8192 write). We expect that in general use, you will not have to change these numbers; we hope we have already done this for you. Our changed default values have produced good results in our tests. However, in order to tune for specific applications, we suggest benchmarking your applications with a variety of values, testing interactive response during some runs if interactive response is important to you. In general, we recommend that you set read latency (-r) to half of write latency (-w).

For example, you might run: /sbin/elvtune -r 1024 -w 2048 /dev/sdd to change the elevator settings for the device /dev/sdd (including all the partitions on /dev/sdd). Changes to the elevator settings for a partition will apply to the elevator for the device the partition is on; all partitions on a device share the same elevator.

Once you have found elvtune settings that give you your most satisfactory mix of latency and throughput for your application set, you can add the calls to the /sbin/elvtune program to the end of your /etc/rc.d/rc.local script so that they are set again to your chosen values at every boot.

Choosing journaling mode

There are some characteristic loads that show very significant speed improvement with the data=writeback option, which provides lower data consistency guarantees. In those cases, the data consistency guarantees are essentially the same as the ext2 file system; the difference is that the file system integrity is maintained continuously during normal operation (this is the journaling mode used by most other journaling file systems). One of these cases involves heavy syncronous writes. Other cases involve creating and deleting large numbers of small files heavily, such as delivering a very large flow of small email messages. If you switch from ext2 to ext3 and find that your application performance drops substantially, the data=writeback option is likely to give you a significant amount of performance back; you will still have some of the availability benefits of ext3 (file system is always consistent) even if you do not have the more expensive data consistency guarantees.

Red Hat is continuing to work on several performance enhancements to ext3, so you can expect several of these cases to improve in the future. This means that if you choose data=writeback now, you may want to retest the default data=ordered with future releases to see what changes have been made relative to your workload.

Data integrity
In most cases, users write data by extending off the end of a file. Only in a few cases (such as databases) do users ever write into the middle of an existing file. Even overwriting an existing file is done by first truncating the file and then extending it again.

If the system crashes during such an extend in data=ordered mode, then the data blocks may have been partially written, but the extend will not have been, so the incompletely-written data blocks will not be part of any file.

The only way to get mis-ordered data blocks in data=ordered mode after a crash is if a program was overwriting in the middle of an existing file at the time of the crash. In such a case there is no absolute guarantee about write ordering unless the program uses fsync() or O_SYNC to force writes in a particular order.