[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

Re: [Linux-cluster] File system slow & crash

Just notice that, on a node it is using kernel version 2.6.18-164.15.1.el5. Don't sure if the difference has any effect.

On Thu, Apr 22, 2010 at 2:27 AM, Somsak Sriprayoonsakul <somsaks gmail com> wrote:

We are using GFS2 on 3 nodes cluster, kernel 2.6.18-164.6.1.el5, RHEL/CentOS5, x86_64 with 8-12GB memory in each node. The underlying storage is HP 2312fc smart array equipped with 12 SAS 15K rpm, configured as RAID10 using 10 HDDs + 2 spares. The array has about 4GB cache. Communication is 4Gbps FC, through HP StorageWorks 8/8 Base e-port SAN Switch.

Our application is apache version 1.3.41, mostly serving static HTML file + few PHP. Note that, we have to downgrade to 1.3.41 due to application requirement. Apache was configured with 500 MaxClients. Each HTML file is placed in different directory. The PHP script modify HTML file and do some locking prior to HTML modification. We use round-robin DNS to load balance between each web server.

The GFS2 storage was formatted with 4 journals, which is run over a LVM volume. We have configured CMAN, QDiskd, Fencing as appropriate and everything works just fine. We used QDiskd since the cluster initially only has 2 nodes. We used manual_fence temporarily since no fencing hardware was configured yet. GFS2 is mounted with noatime,nodiratime option.

Initially, the application was running fine. The problem we encountered is that, over time, load average on some nodes would gradually reach about 300-500, where in normal workload the machine should have about 10. When the load piled up, HTML modification will mostly fail.

We suspected that this might be plock_rate issue, so we modified cluster.conf configuration as well as adding some more mount options, such as num_glockd=16 and data="" to increase the performance. After we successfully reboot the system and mount the volume. We tried ping_pong (http://wiki.samba.org/index.php/Ping_pong) test to see how fast the lock can perform. The lock speed greatly increase from 100 to 3-5k/sec. However, after running ping_pong on all 3 nodes simultaneously, the ping_pong program hang with D state and we could not kill the process even with SIGKILL.

Due to the time constraint, we decided to leave the system as is, letting ping_pong stuck on all nodes while serving web request. After runing for hours, the httpd process got stuck in D state and couldn't be killed. All web serving was not possible at all. We have to reset all machine (unmount was not possible). The machines were back and GFS volume was back to normal.

Since we have to reset all machines, I decided to run gfs2_fsck on the volume. So I unmounted GFS2 on all nodes, run gfs2_fsck, answer "y" to many question about freeing block, and I got the volume back. However, the process stuck up occurred again very quickly. More seriously, trying to kill a running process in GFS or unmount it yield kernel panic and suspend the volume.

After this, the volume was never back to normal again. The volume will crash (kernel panic) almost immediately when we try to write something to it. This happened even if I removed mount option and just leave noatime and nodiratime. I didn't run gfs2_fsck again yet, since we decided to leave it as is and trying to backup as much data as possible.

Sorry for such a long story. In summary, my question is

I have attached our cluster.conf as well as kernel panic log with this e-mail.

Thank you very much in advance

Best Regards,

Somsak Sriprayoonsakul


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]