RHEL4 Sun Java Messaging Server deadlock (was: redhat-list Digest, Vol 84, Issue 5)

Wed Feb 9 13:09:06 UTC 2011

> From: (Imed Chihi) ???? ??????  <imed.chihi at gmail.com>
> Subject: Re: RHEL4 Sun Java Messaging Server deadlock (was:

Humm.. I have looked at my last message and I think that I have
assumed too much.  I believe some parts could sound a bit cryptic.
Here are a few comments which should shed some light on the reasoning
behind my "theories".

> Based on the above, I could suggest two theories to explain what's happening:
>
> 1. you have a Normal zone starvation
> Try to set vm.lower_zone_protection to something large enough like 100 MB:
> sysctl -w vm.lower_zone_protection 100
> If this theory is correct, then the setting should fix the issue.

On 32-bit platforms, and for historical reasons, physical memory is
divided into 3 "zones".  Zones are parts of the physical memory which
need to be managed in different ways.

In your setup (32-bit with hugemem kernel), the Normal zone is 4GB in
size.  The Normal zone is a bit special in the sense that some kernel
allocations can only take place in this zone: typically buffers
allocated for disk and network IO.

What could happen under stress is to run out of memory in this Normal
zone alone.  The result would show as a system coming to crawl and
near deadlock despite having plenty of free memory.  Therefore, having
free memory in the "wrong" zone would not help.

As the Normal zone can also take allocations for regular processes, we
could instruct the memory allocators to avoid filling this Normal zone
with allocations that can be services elsewhere (HighMem zone).  In
your case, you seem to have exhausted the Normal zone (0.5% free),
hence the suggested parameter.

> 2. you have a pagecache flushing storm
> A huge size of dirty pages from the IO of large data sets would stall
> the system while being sync'ed to disk.  This typically occurs once
> the pagecache size has grown to significant sizes.  Mounting the
> filesystem in sync mode (mount -oremount,sync /dev/device) would "fix"
> the issue.  However, synchronous IO is painfully slow, but the test
> would at least tell where the problem is.  If this turns out to be the
> problem, then we could think of other less annoying options for a
> bearable fix.

The Linux virtual memory manager would cache file system IO as long as
there is free memory.  This cache goes into the pagecache which is a
set of memory pages dynamically sized to accommodate the requirements.

When there is "memory pressure", that is, a situation where memory in
at least one zone becomes seriously scarce, the VM would try to free
pages aggressively;  typically, it translates into behaviours like
direct reclaim (freeing memory while a request to allocate memory is
waiting), scanning all pages in a zone repeatedly, etc.  This
"aggressiveness" could result into saturating the IO subsystem for
quite some time while trying to flush a very large pagecache into disk
in the hope of freeing memory.  With a system like yours this
pagecache could be something like 10 or 20GB.  Therefore, writing so
much data to disks would stall the system for quite a long time.

The suggestion would remove the biggest cause of filling the pagecache
by forcing synchronous IO and avoiding storms of very large writes to
the disk.

I hope this is less confusing.

 -Imed

-- 
Imed Chihi - عماد الشيحي
http://perso.hexabyte.tn/ichihi/