plague: Job waited too long for repo to unlock. Killing it...

Michael Schwendt bugs.michael at gmx.net
Mon Dec 31 17:04:06 UTC 2007


On Mon, 31 Dec 2007 11:00:12 -0500, Dan Williams wrote:

> On Sun, 2007-12-30 at 17:54 +0100, Michael Schwendt wrote:
> > If in a failed job.log you see the message
> > 
> >     Job waited too long for repo to unlock. Killing it...
> > 
> > please notify me.
> > 
> > It's a problem in the plague server code that results in a denial of
> > service for subsequent build jobs. I have a traceback from Dec 28th, but
> > in the context of the source code it doesn't make sense yet (because a few
> > lines earlier the code ensures that the files to be copied exist and are
> > readable). Buildsys runs a slightly modified version that adds a bit more
> > debug output in this area.
> 
> Maybe just trap the exception, print it out, and continue?  That way at
> least the server doesn't fall over, it just fails to copy one item. 

The buildsys runs such a patched Repo.py already. It catches OSError,
IOError, unlocks the locks and prints/logs the results of the file access
check prior to when files are copied.

I also added a debug line in the package job code to see when it starts
deleting the copied files. Normally it waits until a callback tells it
that all files are copied.

> It might also help debugging to see if only specific files can't be
> copied...

The offending file was copied, but shutil.copy() failed in its second part
when trying to copy the file mode. It didn't find the source file it had
just copied. :-}




More information about the epel-devel-list mailing list