-
Products
JBoss Enterprise Middleware
Web Server Developer Studio Portfolio Edition JBoss Operations Network FuseSource Integration Products Web Framework Kit Application Platform Data Grid Portal Platform SOA Platform Business Rules Management System (BRMS) Data Services Platform Messaging JBoss Community or JBoss enterprise -
Solutions
By IT challenge
Application development Business process management Enterprise application integration Interoperability Operational efficiency Security VirtualizationMigration Center
Migrate to Red Hat Enterprise Linux Systems management Upgrading to Red Hat Enterprise Linux JBoss Enterprise Middleware IBM AIX to Red Hat Enterprise Linux HP-UX to Red Hat Enterprise Linux Solaris to Red Hat Enterprise Linux UNIX to Red Hat Enterprise Linux Start a conversation with Red Hat Migration services
Issue #1 November 2004
Features
- Meet Fedora Core 3
- The Open Source Triple Play
- Rocking in the Free World
- What is Security-Enhanced Linux?
- The Red Hat Patent Promise: Encouraging Innovation
- Better Living Through RPM, Part 1
- Maximizing Productivity with Evolution
- Understanding Virtual Memory
- Code Internationalization 101
- Double Your Fun with User-mode Linux
From the Inside
In each Issue
Understanding Virtual Memory
by Norm Murray and Neil Horman
- Introduction
- Definitions
- The Life of a Page
- Tuning the VM
- Example Scenarios
- Further Reading
- About the Author
Introduction
One of the most important aspects of an operating system is the Virtual
Memory Management system. Virtual Memory (VM) allows an operating system
to perform many of its advanced functions, such as process isolation, file
caching, and swapping. As such, it is imperative that an administrator
understand the functions and tunable parameters of an operating system's
Virtual Memory Manager so that optimal performance for a given workload
may be achieved. After reading this article, the reader should have a
rudimentary understanding of the data the Red Hat Enterprise Linux (RHEL3)
VM controls and the algorithms it uses. Further, the reader should have a
fairly good understanding of general Linux VM tuning techniques. It is
important to note that Linux as an operating system has a proud legacy of
overhaul. Items which no longer serve useful purposes or which have better
implementations as technology advances are phased out. This implies that
the tuning parameters described in this article may be out of date if you
are using a newer or older kernel. Fear not however! With a well grounded
understanding of the general mechanics of a VM, it is fairly easy to
convert knowledge of VM tuning to another VM. The same general principles
apply, and documentation for a given kernel (including its specific
tunable parameters) can be found in the corresponding kernel source tree
under the file Documentation/sysctl/vm.txt.
Definitions
To properly understand how a Virtual Memory Manager does its job, it helps to understand what components comprise a VM. While the low level view of a VM are overwhelming for most, a high level view is necessary to understand how a VM works and how it can be optimized for workloads.
What Comprises a VM
The inner workings of the Linux virtual memory subsystem are quite complex, but it can be defined at a high level with the following components:
MMU
The Memory Management Unit (MMU) is the hardware base that makes a VM system possible. The MMU allows software to reference physical memory by aliased addresses, quite often more than one. It accomplishes this through the use of pages and page tables. The MMU uses a section of memory to translate virtual addresses into physical addresses via a series of table lookups.
Zoned Buddy Allocator
The Zoned Buddy Allocator is responsible for the management of page allocations to the entire system. This code manages lists of physically contiguous pages and maps them into the MMU page tables, so as to provide other kernel subsystems with valid physical address ranges when the kernel requests them (Physical to Virtual Address mapping is handled by a higher layer of the VM). The name Buddy Allocator is derived from the algorithm this subsystem uses to maintain it free page lists. All physical pages in RAM are cataloged by the Buddy Allocator and grouped into lists. Each list represents clusters of 2n pages, where n is incremented in each list. If no entries exist on the requested list, an entry from the next list up is broken into two separate clusters and is returned to the caller while the other is added to the next list down. When an allocation is returned to the buddy allocator, the reverse process happens. Note that the Buddy Allocator also manages memory zones, which define pools of memory which have different purposes. Currently there are three memory pools which the Buddy Allocator manages accesses for:
-
DMA — This zone consists of the first 16 MB of RAM, from which legacy devices allocate to perform direct memory operations.
-
NORMAL — This zone encompasses memory addresses from 16 MB to 1 GB and is used by the kernel for internal data structures as well as other system and user space allocations.
-
HIGHMEM — This zone includes all memory above 1 GB and is used exclusively for system allocations (file system buffers, user space allocations, etc).
Slab Allocator
The Slab Allocator provides a more usable front end to the Buddy Allocator for those sections of the kernel which require memory in sizes that are more flexible than the standard 4 KB page. The Slab Allocator allows other kernel components to create caches of memory objects of a given size. The Slab Allocator is responsible for placing as many of the cache's objects on a page as possible and monitoring which objects are free and which are allocated. When allocations are requested and no more are available, the Slab Allocator requests more pages from the Buddy Allocator to satisfy the request. This allows kernel components to use memory in a much simpler way. This way components which make use of many small portions of memory are not required to individually implement memory management code so that too many pages are not wasted. The Slab Allocator may only allocate from the DMA and NORMAL zones.
Kernel Threads
The last component in the VM subsystem are the kernel threads:
kscand, kswapd,
kupdated, and bdflush. These
tasks are responsible for the recovery and management of in use
memory. All pages of memory have an associated state (for more
information on the memory state machine, refer to the section called “The Life of a Page” section. In general, the active tasks in the
kernel related to VM usage are responsible for attempting to move pages
out of RAM. Periodically they examine RAM, trying to identify and free
inactive memory so that it can be put to other uses in the
system.
The Life of a Page
All of the memory managed by the VM is labeled by a state. These states help let the VM know what to do with a given page under various circumstances. Dependent on the current needs of the system, the VM may transfer pages from one state to the next, according to the state machine in Figure 2. “VM Page State Machine”. Using these states, the VM can determine what is being done with a page by the system at a given time and what actions the VM may take on the page. The states that have particular meanings are as follows:
-
FREE — All pages available for allocation begin in this state. This indicates to the VM that the page is not being used for any purpose and is available for allocation.
-
ACTIVE — Pages which have been allocated from the Buddy Allocator enter this state. It indicates to the VM that the page has been allocated and is actively in use by the kernel or a user process.
-
INACTIVE DIRTY — This state indicates that the page has fallen into disuse by the entity which allocated it and thus is a candidate for removal from main memory. The
kscandtask periodically sweeps through all the pages in memory, taking note of the amount of time the page has been in memory since it was last accessed. Ifkscandfinds that a page has been accessed since it last visited the page, it increments the page's age counter; otherwise, it decrements that counter. Ifkscandfinds a page with its age counter at zero, it moves the page to the inactive dirty state. Pages in the inactive dirty state are kept in a list of pages to be laundered. -
INACTIVE LAUNDERED — This is an interim state in which those pages which have been selected for removal from main memory enter while their contents are being moved to disk. Only pages which were in the inactive dirty state can enter this state. When the disk I/O operation is complete, the page is moved to the inactive clean state, where it may be deallocated or overwritten for another purpose. If, during the disk operation, the page is accessed, the page is moved back into the active state.
-
INACTIVE CLEAN — Pages in this state have been laundered. This means that the contents of the page are in sync with the backed up data on disk. Thus, they may be deallocated by the VM or overwritten for other purposes.
Tuning the VM
Now that the picture of the VM mechanism is sufficiently illustrated, how is it adjusted to fit certain workloads? There are two methods for changing tunable parameters in the Linux VM. The first is the sysctl interface. The sysctl interface is a programming oriented interface, which allows software programs to modify various tunable parameters directly. It is exported to system administrators via the sysctl utility, which allows an administrator to specify a value for any of the tunable VM parameters via the command line. For example:
sysctl -w vm.max map count=65535
The sysctl utility also supports the use of a configuration file
(/etc/sysctl.conf), in which all the desirable
changes to a VM can be recorded for a system and restored after a restart
of the operating system, making this access method suitable for long term
changes to a system VM. The file is straightforward in its layout, using
simple key-value pairs with comments for clarity. For example:
#Adjust the min and max read-ahead for files vm.max-readahead=64 vm.min-readahead=32 #turn on memory over-commit vm.overcommit_memory=2 #bump up the percentage of memory in use to activate bdflush vm.bdflush="40 500 0 0 500 3000 60 20 0"
The second method of modifying VM tunable parameters is via the proc file
system. This method exports every group of VM tunables as a virtual file,
accessible via all the common Linux utilities used for modifying file
contents. The VM tunables are available in the directory
/proc/sys/vm/ and are most commonly read and modified
using the cat and echo commands. For
example, use the command cat /proc/sys/vm/kswapd to
view the current value of the kswapd tunable. The
output should be similar to:
512 32 8
Then, use the following command to modify the value of the tunable:
echo 511 31 7 > /proc/sys/vm/kswapd
Use the cat /proc/sys/vm/kswapd command again to verify
that the value was modified. The output should be:
511 31 7
The proc file system interface is a convenient method for making
adjustments to the VM while attempting to isolate the peak performance of
a system. For convenience, the following sections list the VM tunable
parameters as the filenames they are exported to in the
/proc/sys/vm/ directory. Unless otherwise noted,
these tunables apply to the RHEL3 2.4.21-4 kernel.
bdflush
The bdflush file contains 9 parameters, of which 6
are tunable. These parameters affect the rate at which pages in the
buffer cache (the subset of pagecache which stores files in memory) are
freed and returned to disk. By adjusting the various values in this
file, a system can be tuned to achieve better performance in
environments where large amounts of file I/O are performed. Table 1. “bdflush Parameters” defines the parameters for
bdflush in the order they appear in the
file.
| Parameter | Description |
|---|---|
nfract
|
The percentage of
dirty pages in the buffer cache required to activate the
bdflush task |
ndirty
|
The maximum
number of dirty pages in the buffer cache to write to disk in
each bdflush execution |
reserved1
|
Reserved for future use |
reserved2
|
Reserved for future |
interval
|
The number of
jiffies (10ms periods) to delay between
bdflush iterations |
age_buffer
|
The time for a normal buffer to age before it is considered for flushing back to disk |
nfract_sync
|
The percentage of dirty pages in the buffer cache required to cause the tasks which are writing pages of memory to begin writing those pages to disk instead |
nfract_stop_bdflush
|
The
percentage of dirty pages in buffer cache required to allow
bdflush to return to idle state |
reserved3
|
Reserved for future use |
bdflush Parameters
Generally, systems that require more free memory for application
allocation want to set the bdflush values higher
(except for the age_buffer, which would be moved
lower), so that file data is sent to disk more frequently and in greater
volume, thus freeing up pages of RAM for application use. This, of
course, comes at the expense of CPU cycles because the system processor
spends more time moving data to disk and less time running
applications. Conversely, systems which are required to perform large
amounts of I/O would want to do the opposite to these values, allowing
more RAM to be used to cache disk file so that file access is faster.
dcache_priority
This file controls the bias of the priority for caching directory contents. When the system is under stress, it selectively reduces the size of various file system caches in an effort to reclaim memory. By increasing this value, memory reclamation bias is shifted away from the dirent cache. By reducing this amount, the bias is shifted towards reclaiming dirent memory. This is not a particularly useful tuning parameter, but it can be helpful in maintaining the interactive response time on an otherwise heavily loaded system. If you experience intolerable delays in communicating with your system when it is busy performing other work, increasing this parameter may help.
hugetlb_pool
The hugetlb_pool file is responsible for recording
the number of megabytes used for huge pages. Huge pages are just like
regular pages in the VM, only they are an order of magnitude
larger. Note also that huge pages are not swappable. Huge pages are both
beneficial and detrimental to a system. They are helpful in that each
huge page takes only one set of entries in the VM page tables, which
allows for a higher degree of virtual address caching in the TLB
(Translation Look-aside Buffer: A device which caches virtual address
translations for faster lookups) and a requisite performance
improvement. On the downside, they are very large and can be wasteful of
memory resources for those applications which do not need large amounts
of memory. Some applications, however, do require large amounts of
memory and can make good use of huge pages if they are written to be
aware of them. If a system is running applications which require large
amounts of memory and is aware of this feature, then it is advantageous
to increase this value to an amount satisfactory to that application or
set of applications.
inactive_clean_percent
This control specifies the minimum percentage of pages in each page zone that must be in the clean or laundered state. If any zone drops below this threshold, and the system is under pressure for more memory, then that zone will begin having its inactive dirty pages laundered. Note that this control is only available on the 2.4.21-5EL kernels forward. Raising the value for the corresponding zone which is memory starved causes pages to be paged out more quickly, eliminating memory starvation at the expense of CPU clock cycles. Lowering this number allows more data to remain in RAM, increasing the system performance but at the risk of memory starvation.
kswapd
While this set of parameters previously defined how frequently and in what volume a system moved non-buffer cache pages to disk, in Red Hat Enterprise Linux 3, these controls are unused.
max_map_count
The max_map_count file allows for the restriction
of the number of VMAs (Virtual Memory Areas) that a particular process
can own. A Virtual Memory Area is a contiguous area of virtual address
space. These areas are created during the life of the process when the
program attempts to memory map a file, links to a shared memory segment,
or allocates heap space. Tuning this value limits the amount of these
VMAs that a process can own. Limiting the amount of VMAs a process can
own can lead to problematic application behavior because the system will
return out of memory errors when a process reaches its VMA limit but can
free up lowmem for other kernel uses. If your system is running low on
memory in the NORMAL zone, then lowering this value will help free up
memory for kernel use.
max-readahead
The max-readahead tunable affects how early the
Linux VFS (Virtual File System) fetches the next block of a file from
memory. File readahead values are determined on a per file basis in the
VFS and are adjusted based on the behavior of the application accessing
the file. Anytime the current position being read in a file plus the
current read ahead value results in the file pointer pointing to the
next block in the file, that block is fetched from disk. By raising this
value, the Linux kernel allows the readahead value to grow larger,
resulting in more blocks being prefetched from disks which predictably
access files in uniform linear fashion. This can result in performance
improvements but can also result in excess (and often unnecessary)
memory usage. Lowering this value has the opposite affect. By forcing
readaheads to be less aggressive, memory may be conserved at a potential
performance impact.
min-readahead
Like max-readahead,
min-readahead places a floor on the readahead
value. Raising this number forces a file's readahead value to be
unconditionally higher, which can bring about performance improvements
provided that all file access in the system is predictably linear from
the start to the end of a file. This, of course, results in higher
memory usage from the pagecache. Conversely, lowering this value, allows
the kernel to conserve pagecache memory at a potential performance
cost.
overcommit_memory
overcommit_memory is a value which sets the general
kernel policy toward granting memory allocations. If the value is 0,
then the kernel checks to determine if there is enough memory free to
grant a memory request to a malloc call from an application. If there is
enough memory, then the request is granted. Otherwise, it is denied and
an error code is returned to the application. If the value is set to 1, then the
kernel grants allocations above the amount of physical RAM and swap in
the system as defined by the overcommit_ratio
value. Enabling this feature can be somewhat helpful in environments
which allocate large amounts of memory expecting worst case scenarios
but do not use it all. If the setting in this
file is 2, the kernel allows all memory allocations, regardless of the
current memory allocation state.
overcommit_ratio
The overcommit_ratio tunable defines the amount by
which the kernel overextends its memory resources in the event that
overcommit_memory is set to the value of 2. The
value in this file represents a percentage added to the amount of actual
RAM in a system when considering whether to grant a particular memory
request. For instance, if this value is set to 50, then the kernel would
treat a system with 1 GB of RAM and 1 GB of swap as a system with 2.5 GB
of allocatable memory when considering whether to grant a malloc request
from an application. The general formula for this tunable is:
allocatable memory=(swap size + (RAM size * overcommit ratio))
Use these previous two parameters with caution. Enabling
overcommit_memory can create significant
performance gains at little cost but only if your applications are
suited to its use. If your applications use all of the memory they
allocate, memory overcommit can lead to short performance gains followed
by long latencies as your applications are swapped out to disk
frequently when they must compete for oversubscribed RAM. Also, ensure
that you have at least enough swap space to cover the overallocation of
RAM (meaning that your swap space should be at least big enough to
handle the percentage if overcommit in addition to the regular 50
percent of RAM that is normally recommended).
pagecache
The pagecache file adjusts the amount of RAM which
can be used by the page cache. The page cache holds various pieces of
data, such as open files from disk, memory mapped files, and pages of
executable programs. Modifying the values in this file dictates how much
of memory is used for this purpose. Table 2. “pagecache Parameters” defines the parameters for pagecache in
the order they appear in the file.
| Parameter | Description |
|---|---|
min
|
The minimum amount of memory to reserve for pagecache use. |
borrow
|
The percentage of pagecache pages
kswapd uses to balance the reclaiming of pagecache pages and
process memory. |
max
|
If more memory than this percentage is
used by pagecache, kswapd only evicts pages from the
pagecache. Once the amount of memory in pagecache is below this
threshold, kswapd begins moving process pages to swap
again. |
pagecache ParametersIncreasing these values allows more programs and cached files to stay in memory longer, thereby allowing applications to execute more quickly. On memory starved systems, however, this may lead to application delays as processes must wait for memory to become available. Moving these values downward swaps processes and other disk-backed data out more quickly, allowing for other processes to obtain memory more easily and increasing execution speed. For most workloads the automatic tuning is sufficient. However, if your workload suffers from excessive swapping and a large cache, you may want to reduce the values until the swapping problem goes away.
page-cluster
The kernel attempts to read multiple pages from disk on a page fault to
avoid excessive seeks on the hard drive. This parameter defines the
number of pages the kernel tries to read from memory during each page
fault. The value is interpreted as
2page-cluster pages for each page fault. A
page fault is encountered every time a virtual memory address is
accessed for which there is not yet a corresponding physical page
assigned or for which the corresponding physical page has been swapped
to disk. If the memory address has been requested in a valid way (for
example, the application contains the address in its virtual memory
map), then the kernel associates a page of RAM with the address or
retrieves the page from disk and places it back in RAM. Then the kernel
restarts the application from where it left off. By increasing the
page-cluster value, pages subsequent to the
requested page are also retrieved, meaning that if the workload of a
particular system accesses data in RAM in a linear fashion, increasing
this parameter can provide significant performance gains (much like the
file readahead parameters described earlier). Of course if your workload
accesses data discreetly in many separate areas of memory, then this can
just as easily cause performance degradation.
Example Scenarios
Now that we have covered the details of kernel tuning, let us look at some example workloads and the various tuning parameters that may improve system performance.
File (IMAP, Web, etc.) Server
This workload is geared towards performing a large amount of I/O to and
from the local disk, thus benefiting from an adjustment allowing more
files to be maintained in RAM. This speeds up I/O by caching more files
in RAM and eliminating the need to wait for disk I/O to complete. A
simple change to sysctl.conf as follows usually
benefits this workload:
#increase the amount of RAM pagecache is allowed to use #before we start moving it back to disk vm.pagecache="10 40 100"
General Compute Server With Many Active Users
This workload is a very general type of configuration. It involves many active users who likely run many processes, all of which may or may not be CPU intensive or I/O intensive or a combination thereof. As the default VM configuration attempts to find a balance between I/O and process memory usage, it may be best to leave most configuration settings alone in this case. However, this environment likely contains many small processes which, regardless of workload, consume memory resources, particularly lowmem. It may help, therefore, to tune the VM to conserve low memory resources when possible:
#lower the pagecache max to keep from eating all memory up with cache vm.pagecache=10 25 50 #lower max-readahead to reduce the amount of unneeded IO vm.max-readahead=16
Non interactive (Batch) Computing Server
A batch computing server is usually the exact opposite of a file server. Applications run without human interaction, and they commonly perform with little I/O. The number of processes running on controlled. Consequently this system should allow maximum throughput:
#Reduce the amount of pagecache normally allowed vm.pagecache="1 10 100" #do not worry about conserving lowmem, not that many processes vm.max_map_count=128000 14 #crank up overcommit, processes can sleep as they are not interactive vm.overcommit=2 vm.overcommit_ratio=75
Further Reading
-
Understanding the Linux Kernel by Daniel Bovet and Marco Cesati (O'Reilly & Associates)
-
Virtual Memory Behavior in Red Hat Enterprise Linux AS 2.1 by Bob Matthews and Norm Murray
-
Towards an O(1) VM by Rik Van Riel
-
The Linux Kernel Source Tree, versions 2.4.21-4EL & 2.4.21-5EL
About the Author
Neil Horman is a software engineer at Red Hat. He lives in Raleigh, NC with his wife and 1 year old son. He has a BS and MS in computer engineering from North Carolina State University. When not enjoying family time he enjoys developing, repairing, and writing about software.
Norm Murray has been working at Red Hat for the last 3 years. Coming to programming after dissatisfaction with the state of genetic engineering, he is now an information and learning junkie.




