Tuning suggestions
Choosing elevator settings
The ext3 file system acts a bit differently than the ext2 file system,
and the differences can appear in various ways. Advanced users may
choose to tune the file system and I/O system for performance. This
is an introduction to some of the more common tuning that advanced
users may wish to try. All tuning, of course, needs to be done in
the context of performance testing of specific applications; there
is no "one size fits all" approach to tuning. This is, however,
intended to provide some generally useful information.
Most Linux block device drivers use a generic tunable "elevator" algorithm for
scheduling block I/O. The /sbin/elvtune program can
be used to trade off between throughput and latency. Given similar loads, the
ext3 file system may require smaller latency numbers as provided to the
/sbin/elvtune program in order to provide similar
results to the ext2 file system.
In some cases, attempting to tune for maximum throughput at the expense of
latency (in this case, large read latency (-r) and write
latency (-w) numbers used with the
/sbin/elvtune program) can actually decrease
throughput while increasing latency. This effect is more pronounced with the
ext3 file system for a variety of reasons.
- With the ext2 file system, writes are scheduled every 30 seconds;
with the ext3 file system, writes are scheduled every 5 seconds.
This keeps journal transactions from having a noticeable impact on
system throughput and also keeps data on disk more up-to-date.
- The ext3 file system, by journaling all metadata changes, can
magnify the effect of atime changes significantly. You can mount a file
system with the noatime flag in order to disable atime
updates. While this is not the only source of metadata updates, on many
systems, particularly highly-accessed servers with lots of accessed files,
atime updates can be responsible for the majority of metadata updates, and
on these systems, turning off atime updates may noticeably reduce latency
and increase throughput.
In order to tune for our default file system choice of ext3, Red Hat has reduced
the default read and write latency numbers to half of the default values (from
8192 read, 16384 write to 4096 read, 8192 write). We expect that in general
use, you will not have to change these numbers; we hope we have already done
this for you. Our changed default values have produced good results in our
tests. However, in order to tune for specific applications, we suggest
benchmarking your applications with a variety of values, testing interactive
response during some runs if interactive response is important to you. In
general, we recommend that you set read latency (-r) to half of
write latency (-w).
For example, you might run:
/sbin/elvtune -r 1024 -w 2048 /dev/sdd
to change the elevator settings for the device /dev/sdd
(including all the partitions on /dev/sdd). Changes to the
elevator settings for a partition will apply to the elevator for the device the
partition is on; all partitions on a device share the same elevator.
Once you have found elvtune settings that give you your most satisfactory mix of
latency and throughput for your application set, you can add the calls to the
/sbin/elvtune program to the end of your
/etc/rc.d/rc.local script so that they are set again
to your chosen values at every boot.
Choosing journaling mode
Speed
There are some characteristic loads that show very significant speed improvement
with the data=writeback option, which provides lower data
consistency guarantees. In those cases, the data consistency guarantees are
essentially the same as the ext2 file system; the difference is that the file
system integrity is maintained continuously during normal operation (this is the
journaling mode used by most other journaling file systems). One of these cases
involves heavy syncronous writes. Other cases involve creating and deleting
large numbers of small files heavily, such as delivering a very large flow of
small email messages. If you switch from ext2 to ext3 and find that your
application performance drops substantially, the data=writeback
option is likely to give you a significant amount of performance back; you will
still have some of the availability benefits of ext3 (file system is always
consistent) even if you do not have the more expensive data consistency
guarantees.
Red Hat is continuing to work on several performance enhancements to ext3, so
you can expect several of these cases to improve in the future. This means that
if you choose data=writeback now, you may want to retest the
default data=ordered with future releases to see what changes
have been made relative to your workload.
Data integrity
In most cases, users write data by extending off the end of a file.
Only in a few cases (such as databases) do users ever write into the
middle of an existing file. Even overwriting an existing file is done
by first truncating the file and then extending it again.
If the system crashes during such an extend in data=ordered
mode, then the data blocks may have been partially written, but the extend will
not have been, so the incompletely-written data blocks will not be part of any
file.
The only way to get mis-ordered data blocks in data=ordered
mode after a crash is if a program was overwriting in the middle of an existing
file at the time of the crash. In such a case there is no absolute guarantee
about write ordering unless the program uses fsync() or
O_SYNC to force writes in a particular order.
|