On 05/25/2011 11:29 PM, Christophe Varoqui wrote:
on a 'multipathd reconfigure' command, the uxclient gets stuck and the multipathd daemon strace shows: $ sudo strace -f -p 17721 Process 17721 attached with 7 threads - interrupt to quit [pid 17757] futex(0x7fdc6a1540a4, FUTEX_WAIT_PRIVATE, 3, NULL <unfinished ...> [pid 17756] futex(0x11167f0, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...> [pid 17755] ioctl(3, DM_DEV_WAIT<unfinished ...> [pid 17724] futex(0x11167f0, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...> [pid 17723] recvmsg(6,<unfinished ...> [pid 17722] futex(0x110a1b4, FUTEX_WAIT_PRIVATE, 15, NULL <unfinished ...> [pid 17721] futex(0x612624, FUTEX_WAIT_PRIVATE, 1, NULLok, I dug it to 9e7b4d8d6fa8dc9433c1e60d4bd6717aec2f5296 Here you add acquire/release the vector lock inside multipathd/main.c:reconfigure(), but as seen in the following LCKDBG trace, the lock is already acquired in multipathd/main.c:uxsock_trigger() Hence the lock -> lock = hang. I commited and pushed a partial revert of 9e7b4d8d6fa8dc9433c1e60d4bd6717aec2f5296 But maybe you'd rather see us stop acquiring the lock from uxsock_trigger() to acquire more selectively in the functions called from there ... Please comment.
Hmm. Yes, your fix appears to be correct. I had several locking issues during startup (calling cli commands while the daemon is still starting up is a nice way of testing it), and several (unsuccessful) attempts in fixing it. Real cause was a missing locking during initial configuration, so it looks as if 9e7b4d8d6fa8dc9433c1e60d4bd6717aec2f5296 was in fact a left-over from the earlier attempts. So yeah, your patch seems to be fine. Will be doing more testing here. Cheers, Hannes -- Dr. Hannes Reinecke zSeries & Storage hare suse de +49 911 74053 688 SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)