fork failed cannot allocate memory

Wed May 5 15:02:16 UTC 2004

Lucas Tepper wrote:

> Hello Ben,
>
> I saw you question below on the redhat-list:
> ----
>
>    * /Subject/: fork failed cannot allocate memory
>    * /Date/: Mon Feb 23 14:42:00 2004
>
> ------------------------------------------------------------------------
> I have a Dell 2650, dual Xeon box, 2GB RAM, 6GB swap with PERC 
> Hardware RAID card. It was running RedHat AS2.1 with the 2.4.9-e.27smp 
> kernel.
> Other than the kernel version, the box was fully up2date with patches.
> ----
>
> We are experiencing the same problem on a HP Proliant DL580 server 
> with dual Xeon 2.8Ghz processors, 10Gb of memory and 6Gb of swap. We 
> are using the 2.4.9-e.27enterprise kernel. We have the problem that 
> every 1.5 week the system 'hangs' and we cannot log in anymore. In the 
> logs I see messages as 'cannot fork: cannot allocate memory'. I made 
> some scripts to monitor the status, but don't see any problems, there 
> is about 4Gb of memory free at the time the system 'hangs'. Strange 
> thing is that we have a couple of other DL580's running with the same 
> configuration, but they don't have (yet) this problem. I was 
> wondering, did your problems go away when you upgraded the kernel to 
> 2.4.9-e38?
>
> Regards,
> Lucas

We did upgrade the kernel and the problems have gone away.  However we 
do not know if it was the kernel or another change that we made at the 
same time.

We had discovered that every time a crash had happened it was also the 
same exact time that a large SCP process (rsync -e ssh) was running over 
a gigabit ethernet link.
We changed the time that the scheduled rsync job was occuring and the 
"fork/crash" time followed the SCP job.
We have multiple server systems in a cluster, although they synchronize 
data all the time, the client machines hit the primary server unless it 
is down.
We moved the rsync job to the backup server in the cluster at the same 
time we upgraded the kernel.
The crash problem hasn't recurred since, but it might have recurred if 
we had left the rsync job on the primary.

Sorry,
-Ben.