[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

Re: More on high i/o load wedging Fedora 10



Eric Sandeen wrote:
Robin Laing wrote:

I always will report bugs if I can get the details. It is almost useless to report bugs if you don't have any details to post with it as there is a request for more details.

Thanks, I understand that.

It takes time to learn what tools to use to find issues. I just read an IBM paper on tracing problems using iostat. I also found dstat at the same time. It is IO related as the problems all come from using or writing to a hard drive. It has also gotten worse and may be related to the latest kernel.

I understand, but in this thread I have repeatedly asked people hitting
this sort of hang to do "sysrq-w" (or, echo w > /proc/sysrq-trigger) -
nobody has ever shown me the results. [1]

I sympathize that it's hard to follow "this" thread, because it keeps
getting re-started under new subjects... :)


I now know what the sysrq-w and how to use it. I will get the data from the machine that I am having an issue with.

My dumping EXT4 is more due to reports that I have read about data loss due to the procedure for write delays. I have run into the issue of losing my kde config files as reported by others on the net already.

http://www.advogato.org/person/mjg59/diary/195.html
http://www.h-online.com/open/Possible-data-loss-in-Ext4--/news/112821
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/317781?comments=all

patches are in Fedora already to mitigate this, it should not be a big
problem for you at this point.  If it is, I need to know about it.

I am also dumping EXT4 as I am trying to trace the issue with locking/freezing computer and I don't need data losses.

I understand; however, they may be related...


No. My last crash was without any EXT4. I was hoping it was. EXT4 may be related but at present it isn't.


As it stands, I did get a kernel oops last night that didn't crash my system and was logged in messages. I was using a tty session so this could be why the system didn't totally freeze.

I have not had time to look through it and to see where it should be posted. It is related to USB as it occurred when I unplugged my USB drive that I was restoring data from. It was late and I was tired so I want to check things on the system before going further.

This may be something of a known issue, depending on the details.  (a
drive disappearing should not actually *oops* the box, but it will
probably spew lots of warnings and errors at least.)


Normal messages I am used to. I have had issues with USB sticks and such and the error messages that I saw were totally different than any I had seen before. The USB system wouldn't work after the error messages until a reboot. The USB drives or sticks wouldn't send any messages to the kernel. Nothing in dmesg or anything else.

There is an issue with filing kernel related bugs if the kernel is tainted because of Nvidia drivers. I have been told before that I need to remove the driver before filing a bug. Well that is hard to do when 3D is needed on the computer with the problem.

That's often true.  Speaking for myself, if there is some weird behavior
never-before reported, and the kernel exhibiting that behavior has
binary modules loaded, I often won't dig into it much because TBH I
can't debug it 100%, and the binary module is always suspect.  But if
the report correlates with other similar reports, it is still useful to
me, even with the binary module loaded.


This is nice to know.

I just tried the sysrq 'w' but I don't have that command on my machine at work.

[1] I probably should have been more explicit when I asked for this.

# echo w > /proc/sysrq-trigger
# dmesg > dmesg_output.txt

should work on any fedora machine out of the box.

Thanks,
-Eric



Thanks for this info. I will try it on my machine at home. I am off work tomorrow and will be looking at this.

As it stands, I had copied the error message to my stick.  Here it is.

=======================
usb 1-6: USB disconnect, address 4
BUG: unable to handle kernel NULL pointer dereference at 00000000
IP: [<c05231e1>] list_del+0x9/0x60
*pde = 3fa85067
Oops: 0000 [#1] SMP
Modules linked in: ext2 usb_storage fuse bridge stp bnep sco l2cap bluetooth asb100 hwmon_vid hwmon sunrpc ip6t_REJECT nf_conntrack_ipv6 ip6table_filter ip6_tables ipv6 ext4 jbd2 crc16 dm_multipath raid1 uinput tuner_simple tuner_types tda9887 tda8290 msp3400 saa7127 saa7115 tuner snd_intel8x0 snd_ac97_codec ac97_bus snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_pcm_oss snd_mpu401snd_mixer_oss nvidia(P) ppdev snd_mpu401_uart snd_rawmidi snd_pcm ivtv snd_timer floppy snd_seq_device snd videodev v4l1_compat compat_ioctl32 sata_sil skge soundcore i2c_algo_bit forcedeth cx2341x ns558 gameport pcspkr v4l2_common snd_page_alloc firewire_ohci tveeprom firewire_core crc_itu_t parport_pc i2c_nforce2 i2c_core parport usblp sha256_generic cbc aes_i586 aes_generic dm_crypt crypto_blkcipher ata_generic pata_acpi pata_amd [last unloaded: scsi_wait_scan]


Pid: 204, comm: khubd Tainted: P (2.6.27.19-170.2.35.fc10.i686 #1) A7N8X-E
EIP: 0060:[<c05231e1>] EFLAGS: 00010006 CPU: 0
EIP is at list_del+0x9/0x60
EAX: 00000000 EBX: ef567b74 ECX: 00000003 EDX: 00000303
ESI: 00000286 EDI: c07ebe14 EBP: f7876dd8 ESP: f7876dd4
 DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
Process khubd (pid: 204, ti=f7876000 task=f7912670 task.ti=f7876000)
Stack: ef567b74 f7876de4 c06aa0f3 ef56796c f7876df4 c044269f ef5678b0 ef56796c f7876e04 c0596851 ef5678b0 ef567904 f7876e18 c0595ff3 ef5678b0 ef567928 f1152e14 f7876e2c c0594e9c ef567800 ef5678b0 00000246 f7876e3c c05a6f25
Call Trace:
 [<c06aa0f3>] ? __up+0xe/0x20
 [<c044269f>] ? up+0x22/0x2f
 [<c0596851>] ? device_release_driver+0x22/0x26
 [<c0595ff3>] ? bus_remove_device+0x78/0x92
 [<c0594e9c>] ? device_del+0xe7/0x15d
 [<c05a6f25>] ? __scsi_remove_device+0x3c/0x6a
 [<c05a4ed2>] ? scsi_forget_host+0x30/0x4f
 [<c059f8c9>] ? scsi_remove_host+0x6a/0xdd
 [<f8e767f3>] ? quiesce_and_remove_host+0x56/0x99 [usb_storage]
 [<f8e768ed>] ? storage_disconnect+0x11/0x1b [usb_storage]
 [<c05d8288>] ? usb_unbind_interface+0x4e/0x9e
 [<c059677b>] ? __device_release_driver+0x70/0x8e
 [<c059684a>] ? device_release_driver+0x1b/0x26
 [<c0595ff3>] ? bus_remove_device+0x78/0x92
 [<c0594e9c>] ? device_del+0xe7/0x15d
 [<c05d6007>] ? usb_disable_device+0x63/0xc1
 [<c05d2532>] ? usb_disconnect+0x76/0x11b
 [<c05d31d6>] ? hub_thread+0x55a/0xe27
 [<c04036bf>] ? __switch_to+0xb9/0x139
 [<c043ef62>] ? autoremove_wake_function+0x0/0x33
 [<c0421ef0>] ? complete+0x34/0x3e
 [<c05d2c7c>] ? hub_thread+0x0/0xe27
 [<c043ecbf>] ? kthread+0x3b/0x61
 [<c043ec84>] ? kthread+0x0/0x61
 [<c040590b>] ? kernel_thread_helper+0x7/0x10
 =======================
Code: 53 08 8d 4b 04 8d 46 04 e8 75 00 00 00 8b 53 10 8d 4b 0c 8d 46 0c e8 67 0 0 00 00 5b 5e 5f 5d c3 90 90 55 89 e5 53 89 c3 8b 40 04 <8b> 00 39 d8 74 16 50
53 68 ea be 77 c0 6a 30 68 24 bf 77 c0 e8
EIP: [<c05231e1>] list_del+0x9/0x60 SS:ESP 0068:f7876dd4
---[ end trace 7117c3f4244a28bf ]---


=======================

Now some information.

This is an older AMD machine.  1Gig ram.

Two 1.5TB Seagate (updated driver) drives partitioned to 9 RAID 1 partitions using the motherboard controller. Upadated BIOS to get the drives recognized.

One 80Gig WD drive for swap and OS.

I am testing luks on the users partitions mounted with pam_mount. This seems to work pretty well. Just doesn't umount the partitions on logout. Could be putting extra load on the process.

After doing some reading, I am finding some D-states with the combination of RAID and crypt on the system and the load does go up but isn't that high most of the time. During the file copies, it would go to the 3 to 4 range.

Now the system seemed stable until last week and didn't crash. I have yet find the time to test an older kernel but will try it tomorrow.

One time just before it crashed. I was watching top and the load shot up over 12 but the system seemed quite responsive. I have since learned that D-states can cause the load to show a high reading yet the machine is still responsive as it is calls to devices waiting for IO.

I have had machines go to super heavy loads without crashing in the past so I think the kernel should be able to stay stable.

--
Robin Laing


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]