[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

RE: [Linux-cluster] Problem with GFS2 - Kernel Panic - Can NOT erase directory



Hello again,

 

Becoming queries why only once service fails, I tried to encircle the root cause.

 

I ended up that files in only one directory (were the failing service keeps its files), are corrupted.

 

Trying to ls –l in the directory gives the following output:

 

ls: reading directory .: Input/output error

total 192

?--------- ? ?     ?          ?            ? account_boinc.bakerlab.org_rosetta.xml

?--------- ? ?     ?          ?            ? account_climateprediction.net.xml

?--------- ? ?     ?          ?            ? account_predictor.chem.lsa.umich.edu.xml

?--------- ? ?     ?          ?            ? all_projects_list.xml

-rw-r--r-- 1 boinc boinc 159796 Jun 22 22:47 client_state_prev.xml

?--------- ? ?     ?          ?            ? client_state.xml

-rw-r--r-- 1 boinc boinc   5141 Jun 13 23:21 get_current_version.xml

?--------- ? ?     ?          ?            ? get_project_config.xml

-rw-r--r-- 1 boinc boinc    899 Apr  4 17:06 global_prefs.xml

?--------- ? ?     ?          ?            ? gui_rpc_auth.cfg

?--------- ? ?     ?          ?            ? job_log_boinc.bakerlab.org_rosetta.txt

?--------- ? ?     ?          ?            ? job_log_predictor.chem.lsa.umich.edu.txt

?--------- ? ?     ?          ?            ? lockfile

?--------- ? ?     ?          ?            ? lookup_account.xml

?--------- ? ?     ?          ?            ? lookup_website.html

?--------- ? ?     ?          ?            ? master_boinc.bakerlab.org_rosetta.xml

?--------- ? ?     ?          ?            ? master_climateprediction.net.xml

?--------- ? ?     ?          ?            ? master_predictor.chem.lsa.umich.edu.xml

?--------- ? ?     ?          ?            ? projects

?--------- ? ?     ?          ?            ? sched_reply_boinc.bakerlab.org_rosetta.xml

?--------- ? ?     ?          ?            ? sched_reply_climateprediction.net.xml

?--------- ? ?     ?          ?            ? sched_reply_predictor.chem.lsa.umich.edu.xml

?--------- ? ?     ?          ?            ? sched_request_boinc.bakerlab.org_rosetta.xml

-rw-r--r-- 1 boinc boinc   6766 Jun 22 21:27 sched_request_climateprediction.net.xml

?--------- ? ?     ?          ?            ? sched_request_predictor.chem.lsa.umich.edu.xml

?--------- ? ?     ?          ?            ? slots

?--------- ? ?     ?          ?            ? statistics_boinc.bakerlab.org_rosetta.xml

?--------- ? ?     ?          ?            ? statistics_climateprediction.net.xml

?--------- ? ?     ?          ?            ? statistics_predictor.chem.lsa.umich.edu.xml

?--------- ? ?     ?          ?            ? stderrdae.txt

?--------- ? ?     ?          ?            ? stdoutdae.txt

?--------- ? ?     ?          ?            ? time_stats_log

 

At the same moment the kernel reports what is following below (attached the previous e-mail).

 

Trying to rm –rf the directory fails with the same kernel message.

 

Any ideas on how to erase the problematic directory?

 

 

Also the other node (the one on which I do not try to make any actions on the file system in question, gives the following message:

 

 

GFS2: fsid=tweety:gfs2-00.0: jid=1: Trying to acquire journal lock...

GFS2: fsid=tweety:gfs2-00.0: jid=1: Busy

 

And the file system becomes inaccessible forever. Any one knows why is that?

 

Thank you all for your time

T. Kontogiannis

 

 

From: linux-cluster-bounces redhat com [mailto:linux-cluster-bounces redhat com] On Behalf Of Theophanis Kontogiannis
Sent: Monday, June 30, 2008 5:52 PM
To: 'linux clustering'
Subject: [Linux-cluster] Problem with GFS2 - Kernel Panic

 

Hello all,

 

I have a two node cluster with DRBD running in Primary/Primary.

Both nodes are running:

 

·         Kernel 2.6.18-92.1.6.el5.centos.plus

·         GFS2 fsck 0.1.44

·         cman_tool 2.0.84

·         Cluster LVM daemon version: 2.02.32-RHEL5 (2008-03-04)

Protocol version:           0.2.1

·         DRBD Version: 8.2.6 (api:88)

 

 

After a corruption (which was the result of combining updating and rebooting with the FS mounted, network interruption during the reboot and like issues, I keep on getting the following on one node:

 

Jun 30 00:13:40 tweety1 clurgmgrd[5283]: <notice> stop on script "BOINC" returned 1 (generic error)

Jun 30 00:13:40 tweety1 clurgmgrd[5283]: <info> Services Initialized

Jun 30 00:13:40 tweety1 clurgmgrd[5283]: <info> State change: Local UP

Jun 30 00:13:45 tweety1 clurgmgrd[5283]: <notice> Starting stopped service service:BOINC-t1

Jun 30 00:13:45 tweety1 kernel: GFS2: fsid=tweety:gfs2-00.0: fatal: invalid metadata block

Jun 30 00:13:45 tweety1 kernel: GFS2: fsid=tweety:gfs2-00.0:   bh = 21879736 (magic number)

Jun 30 00:13:45 tweety1 kernel: GFS2: fsid=tweety:gfs2-00.0:   function = gfs2_meta_indirect_buffer, file = fs/gfs2/meta_io.c, line = 332

Jun 30 00:13:45 tweety1 kernel: GFS2: fsid=tweety:gfs2-00.0: about to withdraw this file system

Jun 30 00:13:45 tweety1 kernel: GFS2: fsid=tweety:gfs2-00.0: telling LM to withdraw

Jun 30 00:13:46 tweety1 clurgmgrd[5283]: <notice> Service service:BOINC-t1 started

Jun 30 00:13:46 tweety1 kernel: GFS2: fsid=tweety:gfs2-00.0: withdrawn

Jun 30 00:13:46 tweety1 kernel:

Jun 30 00:13:46 tweety1 kernel: Call Trace:

Jun 30 00:13:46 tweety1 kernel:  [<ffffffff88629146>] :gfs2:gfs2_lm_withdraw+0xc1/0xd0

Jun 30 00:13:46 tweety1 kernel:  [<ffffffff800639de>] __wait_on_bit+0x60/0x6e

Jun 30 00:13:46 tweety1 kernel:  [<ffffffff80014eec>] sync_buffer+0x0/0x3f

Jun 30 00:13:46 tweety1 kernel:  [<ffffffff80063a58>] out_of_line_wait_on_bit+0x6c/0x78

Jun 30 00:13:46 tweety1 kernel:  [<ffffffff8009d1bb>] wake_bit_function+0x0/0x23

Jun 30 00:13:46 tweety1 kernel:  [<ffffffff8863af7f>] :gfs2:gfs2_meta_check_ii+0x2c/0x38

Jun 30 00:13:46 tweety1 kernel:  [<ffffffff8862ca06>] :gfs2:gfs2_meta_indirect_buffer+0x104/0x15e

Jun 30 00:13:46 tweety1 kernel:  [<ffffffff8862795a>] :gfs2:gfs2_inode_refresh+0x22/0x2ca

Jun 30 00:13:46 tweety1 kernel:  [<ffffffff8009d1bb>] wake_bit_function+0x0/0x23

Jun 30 00:13:46 tweety1 kernel:  [<ffffffff88626d9c>] :gfs2:inode_go_lock+0x29/0x57

Jun 30 00:13:47 tweety1 kernel:  [<ffffffff88625f04>] :gfs2:glock_wait_internal+0x1d4/0x23f

Jun 30 00:13:47 tweety1 kernel:  [<ffffffff8862611d>] :gfs2:gfs2_glock_nq+0x1ae/0x1d4

Jun 30 00:13:47 tweety1 kernel:  [<ffffffff88632053>] :gfs2:gfs2_lookup+0x58/0xa7

Jun 30 00:13:47 tweety1 kernel:  [<ffffffff8863204b>] :gfs2:gfs2_lookup+0x50/0xa7

Jun 30 00:13:47 tweety1 kernel:  [<ffffffff80022663>] d_alloc+0x174/0x1a9

Jun 30 00:13:47 tweety1 kernel:  [<ffffffff8000cbb4>] do_lookup+0xd3/0x1d4

Jun 30 00:13:47 tweety1 kernel:  [<ffffffff80009f73>] __link_path_walk+0xa01/0xf42

Jun 30 00:13:47 tweety1 kernel:  [<ffffffff8861fd37>] :gfs2:compare_dents+0x0/0x57

Jun 30 00:13:47 tweety1 kernel:  [<ffffffff8000e782>] link_path_walk+0x5c/0xe5

Jun 30 00:13:47 tweety1 kernel:  [<ffffffff88624d6f>] :gfs2:gfs2_glock_put+0x26/0x133

 

 

After that, the machine freezes completely. The only way to recover is to power-cycle / reset.

 

“gfs2-fsck –vy /dev/mapper/vg0-data0” ends (not terminates, it just look like it finishes) with:

 

Pass5 complete

Writing changes to disk

gfs2_fsck: buffer still held for block: 21875415 (0x14dcad7)

 

After remounting the file system and having a service start (that has its files on this gfs2 filesystem), the kernel again crasses with the same message and the node freezes up.

 

Unfortunately due to bad handling, I failed to DRBD invalidate the problematic node, and instead of making it sync target (which theoretically would solve the problem, since the good node, would sync the bad node).

Instead I made the bad node, sync source and now both nodes have the same issue L

 

 

Any ideas of how can I resolve this issue?

 

Sincerely,

 

Theophanis Kontogiannis

 


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]