More on high i/o load wedging Fedora 10

Thu Mar 19 20:33:54 UTC 2009

Eric Sandeen wrote:
> Robin Laing wrote:
> 
>> I always will report bugs if I can get the details.  It is almost 
>> useless to report bugs if you don't have any details to post with it as 
>> there is a request for more details.
> 
> Thanks, I understand that.
> 
>> It takes time to learn what tools to use to find issues.  I just read an 
>> IBM paper on tracing problems using iostat.  I also found dstat at the 
>> same time.  It is IO related as the problems all come from using or 
>> writing to a hard drive.  It has also gotten worse and may be related to 
>> the latest kernel.
> 
> I understand, but in this thread I have repeatedly asked people hitting
> this sort of hang to do "sysrq-w" (or, echo w > /proc/sysrq-trigger) -
> nobody has ever shown me the results. [1]
> 
> I sympathize that it's hard to follow "this" thread, because it keeps
> getting re-started under new subjects... :)
> 

I now know what the sysrq-w and how to use it.  I will get the data from 
the machine that I am having an issue with.

>> My dumping EXT4 is more due to reports that I have read about data loss 
>> due to the procedure for write delays.  I have run into the issue of 
>> losing my kde config files as reported by others on the net already.
>>
>> http://www.advogato.org/person/mjg59/diary/195.html
>> http://www.h-online.com/open/Possible-data-loss-in-Ext4--/news/112821
>> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/317781?comments=all
> 
> patches are in Fedora already to mitigate this, it should not be a big
> problem for you at this point.  If it is, I need to know about it.
> 
>> I am also dumping EXT4 as I am trying to trace the issue with 
>> locking/freezing computer and I don't need data losses.
> 
> I understand; however, they may be related...
> 

No.  My last crash was without any EXT4.  I was hoping it was.  EXT4 may 
be related but at present it isn't.

>> As it stands, I did get a kernel oops last night that didn't crash my 
>> system and was logged in messages.  I was using a tty session so this 
>> could be why the system didn't totally freeze.
>>
>> I have not had time to look through it and to see where it should be 
>> posted.  It is related to USB as it occurred when I unplugged my USB 
>> drive that I was restoring data from.  It was late and I was tired so I 
>> want to check things on the system before going further.
> 
> This may be something of a known issue, depending on the details.  (a
> drive disappearing should not actually *oops* the box, but it will
> probably spew lots of warnings and errors at least.)
> 

Normal messages I am used to.  I have had issues with USB sticks and 
such and the error messages that I saw were totally different than any I 
had seen before.  The USB system  wouldn't work after the error messages 
until a reboot.  The USB drives or sticks wouldn't send any messages to 
the kernel.  Nothing in dmesg or anything else.

>> There is an issue with filing kernel related bugs if the kernel is 
>> tainted because of Nvidia drivers.  I have been told before that I need 
>> to remove the driver before filing a bug.  Well that is hard to do when 
>> 3D is needed on the computer with the problem.
> 
> That's often true.  Speaking for myself, if there is some weird behavior
> never-before reported, and the kernel exhibiting that behavior has
> binary modules loaded, I often won't dig into it much because TBH I
> can't debug it 100%, and the binary module is always suspect.  But if
> the report correlates with other similar reports, it is still useful to
> me, even with the binary module loaded.
> 

This is nice to know.

>> I just tried the sysrq 'w' but I don't have that command on my machine 
>> at work.
> 
> [1] I probably should have been more explicit when I asked for this.
> 
> # echo w > /proc/sysrq-trigger
> # dmesg > dmesg_output.txt
> 
> should work on any fedora machine out of the box.
> 
> Thanks,
> -Eric
> 
> 

Thanks for this info.  I will try it on my machine at home.  I am off 
work tomorrow and will be looking at this.

As it stands, I had copied the error message to my stick.  Here it is.

=======================
usb 1-6: USB disconnect, address 4
BUG: unable to handle kernel NULL pointer dereference at 00000000
IP: [<c05231e1>] list_del+0x9/0x60
*pde = 3fa85067
Oops: 0000 [#1] SMP
Modules linked in: ext2 usb_storage fuse bridge stp bnep sco l2cap 
bluetooth asb100 hwmon_vid hwmon sunrpc ip6t_REJECT nf_conntrack_ipv6 
ip6table_filter ip6_tables ipv6 ext4 jbd2 crc16 dm_multipath raid1 
uinput tuner_simple tuner_types tda9887 tda8290 msp3400 saa7127 saa7115 
tuner snd_intel8x0 snd_ac97_codec ac97_bus snd_seq_dummy snd_seq_oss 
snd_seq_midi_event snd_seq snd_pcm_oss snd_mpu401snd_mixer_oss nvidia(P) 
ppdev snd_mpu401_uart snd_rawmidi snd_pcm ivtv snd_timer floppy 
snd_seq_device snd videodev v4l1_compat compat_ioctl32 sata_sil skge 
soundcore i2c_algo_bit forcedeth cx2341x ns558 gameport pcspkr 
v4l2_common snd_page_alloc firewire_ohci tveeprom firewire_core 
crc_itu_t parport_pc i2c_nforce2 i2c_core parport usblp sha256_generic 
cbc aes_i586 aes_generic dm_crypt crypto_blkcipher ata_generic pata_acpi 
pata_amd [last unloaded: scsi_wait_scan]

Pid: 204, comm: khubd Tainted: P          (2.6.27.19-170.2.35.fc10.i686 
#1) A7N8X-E
EIP: 0060:[<c05231e1>] EFLAGS: 00010006 CPU: 0
EIP is at list_del+0x9/0x60
EAX: 00000000 EBX: ef567b74 ECX: 00000003 EDX: 00000303
ESI: 00000286 EDI: c07ebe14 EBP: f7876dd8 ESP: f7876dd4
  DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
Process khubd (pid: 204, ti=f7876000 task=f7912670 task.ti=f7876000)
Stack: ef567b74 f7876de4 c06aa0f3 ef56796c f7876df4 c044269f ef5678b0 
ef56796c
        f7876e04 c0596851 ef5678b0 ef567904 f7876e18 c0595ff3 ef5678b0 
ef567928
        f1152e14 f7876e2c c0594e9c ef567800 ef5678b0 00000246 f7876e3c 
c05a6f25
Call Trace:
  [<c06aa0f3>] ? __up+0xe/0x20
  [<c044269f>] ? up+0x22/0x2f
  [<c0596851>] ? device_release_driver+0x22/0x26
  [<c0595ff3>] ? bus_remove_device+0x78/0x92
  [<c0594e9c>] ? device_del+0xe7/0x15d
  [<c05a6f25>] ? __scsi_remove_device+0x3c/0x6a
  [<c05a4ed2>] ? scsi_forget_host+0x30/0x4f
  [<c059f8c9>] ? scsi_remove_host+0x6a/0xdd
  [<f8e767f3>] ? quiesce_and_remove_host+0x56/0x99 [usb_storage]
  [<f8e768ed>] ? storage_disconnect+0x11/0x1b [usb_storage]
  [<c05d8288>] ? usb_unbind_interface+0x4e/0x9e
  [<c059677b>] ? __device_release_driver+0x70/0x8e
  [<c059684a>] ? device_release_driver+0x1b/0x26
  [<c0595ff3>] ? bus_remove_device+0x78/0x92
  [<c0594e9c>] ? device_del+0xe7/0x15d
  [<c05d6007>] ? usb_disable_device+0x63/0xc1
  [<c05d2532>] ? usb_disconnect+0x76/0x11b
  [<c05d31d6>] ? hub_thread+0x55a/0xe27
  [<c04036bf>] ? __switch_to+0xb9/0x139
  [<c043ef62>] ? autoremove_wake_function+0x0/0x33
  [<c0421ef0>] ? complete+0x34/0x3e
  [<c05d2c7c>] ? hub_thread+0x0/0xe27
  [<c043ecbf>] ? kthread+0x3b/0x61
  [<c043ec84>] ? kthread+0x0/0x61
  [<c040590b>] ? kernel_thread_helper+0x7/0x10
  =======================
Code: 53 08 8d 4b 04 8d 46 04 e8 75 00 00 00 8b 53 10 8d 4b 0c 8d 46 0c 
e8 67 0
0 00 00 5b 5e 5f 5d c3 90 90 55 89 e5 53 89 c3 8b 40 04 <8b> 00 39 d8 74 
16 50
53 68 ea be 77 c0 6a 30 68 24 bf 77 c0 e8
EIP: [<c05231e1>] list_del+0x9/0x60 SS:ESP 0068:f7876dd4
---[ end trace 7117c3f4244a28bf ]---

=======================

Now some information.

This is an older AMD machine.  1Gig ram.

Two 1.5TB Seagate (updated driver) drives partitioned to 9 RAID 1 
partitions using the motherboard controller.  Upadated BIOS to get the 
drives recognized.

One 80Gig WD drive for swap and OS.

I am testing luks on the users partitions mounted with pam_mount.  This 
seems to work pretty well.  Just doesn't umount the partitions on 
logout.  Could be putting extra load on the process.

After doing some reading, I am finding some D-states with the 
combination of RAID and crypt on the system and the load does go up but 
isn't that high most of the time.  During the file copies, it would go 
to the 3 to 4 range.

Now the system seemed stable until last week and didn't crash.  I have 
yet find the time to test an older kernel but will try it tomorrow.

One time just before it crashed.  I was watching top and the load shot 
up over 12 but the system seemed quite responsive.  I have since learned 
that D-states can cause the load to show a high reading yet the machine 
is still responsive as it is calls to devices waiting for IO.

I have had machines go to super heavy loads without crashing in the past 
so I think the kernel should be able to stay stable.

--
Robin Laing