[Linux-cluster] gfs2 and quotas - system crash

Tue Mar 11 10:43:24 UTC 2014

No we are not using NFS. Our setup is:

1. Two node cluster with two node option.
2. Hitachi SAN (RAID 6) connected to both nodes via 4 Gig.
3. 10TB, two 4TB and one 2TB disks to each node using gfs2 (separate file systems on each disk) with user quota enabled. Only the two nodes in the cluster mount the drives. 
4. User fills up their quota on the 10TB disk and the system crashes (which appears to be a consistent outcome).

The quota was only 10G for the user so they were not using a vast amount of space. In total 5TB is currently used on the drive:

Filesystem                           Size  Used Avail Use% Mounted on
/dev/mapper/vg_chadwick-LogVol00     9.8G  3.4G  6.0G  36% /
tmpfs                                253G   47M  253G   1% /dev/shm
/dev/mapper/mpathap3                1008M  148M  810M  16% /boot
/dev/mapper/vg_chadwick-LogVol06      11T  4.7T  5.8T  45% /home
/dev/mapper/vg_chadwick-LogVol05     9.8G  7.5G  1.8G  82% /opt
/dev/mapper/vg_chadwick-LogVol01     5.0G  140M  4.6G   3% /tmp
/dev/mapper/vg_chadwick-LogVol02     9.8G  8.3G  976M  90% /usr
/dev/mapper/vg_chadwick-LogVol03     5.0G  2.7G  2.1G  57% /var
/dev/mapper/sanvg1-sanlv1            4.0T  2.9T  1.2T  71% /san1
/dev/mapper/sanvg2-sanlv2            4.0T  3.2T  851G  80% /san2
/dev/mapper/sanvg3-sanlv3            2.0T  1.8T  259G  88% /san3
/dev/mapper/sanvg4-lvol0              10T  5.1T  5.0T  51% /san4

Filesystem                              Inodes   IUsed      IFree IUse% Mounted on
/dev/mapper/vg_chadwick-LogVol00        647168   54317     592851    9% /
tmpfs                                 66157732      58   66157674    1% /dev/shm
/dev/mapper/mpathap3                     65536      62      65474    1% /boot
/dev/mapper/vg_chadwick-LogVol06     749502464 1002734  748499730    1% /home
/dev/mapper/vg_chadwick-LogVol05        647168  236023     411145   37% /opt
/dev/mapper/vg_chadwick-LogVol01        327680     378     327302    1% /tmp
/dev/mapper/vg_chadwick-LogVol02        647168  318728     328440   50% /usr
/dev/mapper/vg_chadwick-LogVol03        327680    7228     320452    3% /var
/dev/mapper/sanvg1-sanlv1            320266537  140997  320125540    1% /san1
/dev/mapper/sanvg2-sanlv2            223028034   44074  222983960    1% /san2
/dev/mapper/sanvg3-sanlv3             67820453    8357   67812096    1% /san3
/dev/mapper/sanvg4-lvol0            1336002497  392526 1335609971    1% /san4

Thanks,

Stephen.

-----Original Message-----
From: Abhijith Das [mailto:adas at redhat.com] 
Sent: 10 March 2014 19:38
To: linux clustering
Subject: Re: [Linux-cluster] gfs2 and quotas - system crash

----- Original Message -----
> From: "stephen rankin" <stephen.rankin at stfc.ac.uk>
> To: linux-cluster at redhat.com
> Sent: Monday, March 10, 2014 1:15:08 PM
> Subject: [Linux-cluster] gfs2 and quotas - system crash
> 
> Hello,
> 
> 
> 
> When using gfs2 with quotas on a SAN that is providing storage to two 
> clustered systems running CentOS6.5, one of the systems can crash. 
> This crash appears to be caused when a user tries to add something to 
> a SAN disk when they have exceeded their quota on that disk. Sometimes 
> a stack trace is produced in /var/log/messages which appears to 
> indicate that it was gfs2 that caused the problem.
> At the same time you get the gfs2 stack trace you also see problems 
> with someone exceeding their quota.
> 
> The stack trace is below.
> 
> Has anyone got a solution to this, other than switching of quotas? I 
> have switched of quotas which appears to have stabilised the system so 
> far, but I do need the quotas on.
> 
> Your help is appreciated.
> 

Hi Stephen,

We have another report of this bug when gfs2 was exported using NFS. 
https://bugzilla.redhat.com/show_bug.cgi?id=1059808. Are you using NFS in your setup as well? We have not able to reproduce it to figure out what might be going on. Do you have a set procedure that you're able to recreate with reliably? If so, it would be of great help.
Also, more info about your setup (file sizes, number of files, how many nodes mounting gfs2, what kinds of operations are being run) etc would be helpful as well.

Cheers!
--Abhi

> Stephen Rankin
> STFC, RAL, ISIS
> 
> Mar  5 11:40:50 chadwick kernel: GFS2: fsid=analysis:lvol0.1: quota 
> exceeded for user 101355 Mar  5 11:40:50 chadwick nslcd[11420]: 
> [767df3] ldap_explode_dn(usi660) returned NULL: Success Mar  5 
> 11:40:50 chadwick nslcd[11420]: [767df3] ldap_result() failed: Invalid 
> DN syntax Mar  5 11:40:50 chadwick nslcd[11420]: [767df3] lookup of 
> user usi660 failed:
> Invalid DN syntax
> Mar  5 11:41:46 chadwick kernel: ------------[ cut here ]------------ 
> Mar  5 11:41:46 chadwick kernel: WARNING: at lib/list_debug.c:26
> __list_add+0x6d/0xa0() (Not tainted)
> Mar  5 11:41:46 chadwick kernel: Hardware name: PowerEdge R910 Mar  5 
> 11:41:46 chadwick kernel: list_add corruption. next->prev should be 
> prev (ffff8820531518d0), but was ffff884d4c4594d0. (next=ffff884d4c4594d0).
> Mar  5 11:41:46 chadwick kernel: Modules linked in: gfs2 dlm configfs 
> bridge
> autofs4 des_generic ecb md4 nls_utf8 cifs bnx2fc cnic uio fcoe libfcoe 
> libfc 8021q garp stp llc ipv6 microcode power_meter iTCO_wdt 
> iTCO_vendor_support dcdbas serio_raw ixgbe dca ptp pps_core mdio 
> lpc_ich mfd_core sg ses enclosure i7core_edac edac_core bnx2 ext4 jbd2 
> mbcache dm_round_robin sr_mod cdrom sd_mod crc_t10dif qla2xxx 
> scsi_transport_fc scsi_tgt pata_acpi ata_generic ata_piix megaraid_sas 
> dm_multipath dm_mirror dm_region_hash dm_log dm_mod [last unloaded: 
> speedstep_lib] Mar  5 11:41:46 chadwick kernel: Pid: 74823, comm: 
> vncserver Not tainted
> 2.6.32-431.3.1.el6.x86_64 #1
> Mar  5 11:41:46 chadwick kernel: Call Trace:
> Mar  5 11:41:46 chadwick kernel: [<ffffffff81071e27>] ?
> warn_slowpath_common+0x87/0xc0
> Mar  5 11:41:46 chadwick kernel: [<ffffffff81071f16>] ?
> warn_slowpath_fmt+0x46/0x50
> Mar  5 11:41:46 chadwick kernel: [<ffffffff812944ed>] ? 
> __list_add+0x6d/0xa0 Mar  5 11:41:46 chadwick kernel: 
> [<ffffffff811a6c02>] ? new_inode+0x72/0xb0 Mar  5 11:41:46 chadwick kernel: [<ffffffffa03f45d5>] ?
> gfs2_create_inode+0x1b5/0x1150 [gfs2]
> Mar  5 11:41:46 chadwick kernel: [<ffffffffa03f3986>] ?
> gfs2_glock_nq_init+0x16/0x40 [gfs2]
> Mar  5 11:41:46 chadwick kernel: [<ffffffffa03ffc74>] ? 
> gfs2_mkdir+0x24/0x30 [gfs2] Mar  5 11:41:46 chadwick kernel: 
> [<ffffffff8122766f>] ?
> security_inode_mkdir+0x1f/0x30
> Mar  5 11:41:46 chadwick kernel: [<ffffffff81198149>] ? 
> vfs_mkdir+0xd9/0x140 Mar  5 11:41:46 chadwick kernel: [<ffffffff8119ab67>] ?
> sys_mkdirat+0xc7/0x1b0
> Mar  5 11:41:46 chadwick kernel: [<ffffffff8119ac68>] ? 
> sys_mkdir+0x18/0x20 Mar  5 11:41:46 chadwick kernel: [<ffffffff8100b072>] ?
> system_call_fastpath+0x16/0x1b
> Mar  5 11:41:46 chadwick kernel: ---[ end trace e51734a39976a028 ]--- 
> Mar  5 11:41:46 chadwick kernel: GFS2: fsid=analysis:lvol0.1: quota 
> exceeded for user 101355 Mar  5 11:41:47 chadwick abrtd: Directory 
> 'oops-2014-03-05-11:41:47-12194-1'
> creation detected
> Mar  5 11:41:47 chadwick abrt-dump-oops: Reported 1 kernel oopses to 
> Abrt Mar  5 11:41:47 chadwick abrtd: Can't open file
> '/var/spool/abrt/oops-2014-03-05-11:41:47-12194-1/uid': No such file 
> or directory Mar  5 11:41:54 chadwick kernel: GFS2: 
> fsid=analysis:lvol0.1: quota exceeded for user 101355
> 
> 
> 
> 
> --
> Scanned by iCritical.
> 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster
-- 
Scanned by iCritical.