udev performance

Wed Sep 12 08:48:35 UTC 2007

On Wed, Sep 12, 2007 at 09:29:45AM +0200, Harald Hoyer wrote:
> Jakub Jelinek schrieb:
> >BTW, each read_depends below will scan the whole modules.dep (if 
> >unsuccessful)
> >or first part thereof.  In the modprobes that take most of the time, is
> >modules.dep read just once or multiple times?
> 
> Hmm, don't know exactly what you mean... attached are the times.
> 
> modalias;real;user;sys
> 
> First modprobe takes the longest, as it fills the cache.

E.g. 
strace /sbin/modprobe pci:v0000104Cd0000803Bsv0000104Dsd000081EFbc01sc80i00
shows
open("/lib/modules/2.6.22.1-27.fc7/modules.dep", O_RDONLY) = 3
+ ~ 65 x 4KB read + parsing it in userspace
open("/lib/modules/2.6.22.1-27.fc7/modules.alias", O_RDONLY) = 3
+ ~ 65 x 4KB read + parsing it in userspace
open("/lib/modules/2.6.22.1-27.fc7/modules.dep", O_RDONLY) = 3
+ ~ 40 x 4KB read + parsing it in userspace

While both the big files are probably cached already, even the reads from
cache and especially parsing can mean significant time spent in it.

I have ran 50 times
/sbin/modprobe pci:v0000104Cd0000803Bsv0000104Dsd000081EFbc01sc80i00
under oprofile, 37.9% is spent in the kernel which I didn't have
debuginfo for.  But then we have:
0000003a628672a0 7031     24.6745  libc-2.6.so              fgetc
0000000000402370 2974     10.4369  modprobe                 getline_wrapped
0000003a62877580 1299      4.5587  libc-2.6.so              strpbrk
0000000000401310 871       3.0567  modprobe                 .plt
0000000000402d70 808       2.8356  modprobe                 underscores
0000003a628778c0 567       1.9898  libc-2.6.so              strspn
0000003a62876a20 492       1.7266  libc-2.6.so              strchr
0000003a6286e360 441       1.5476  libc-2.6.so              malloc_consolidate
0000003a628705b0 396       1.3897  libc-2.6.so              _int_malloc

static int fgetc_wrapped(FILE *file, unsigned int *linenum)
{
        for (;;) {
                int ch = fgetc(file);
                if (ch != '\\')
                        return ch;
                ch = fgetc(file);
                if (ch != '\n')
                        return ch;
                if (linenum)
                        (*linenum)++;
        }
}

static char *getline_wrapped(FILE *file, unsigned int *linenum)
{
        int size = 1024;
        int i = 0;
        char *buf = NOFAIL(malloc(size));
        for(;;) {
                int ch = fgetc_wrapped(file, linenum);
                if (i == size) {
                        size *= 2;
                        buf = NOFAIL(realloc(buf, size));
                }
                if (ch < 0 && i == 0) {
                        free(buf);
                        return NULL;
                }
                if (ch < 0 || ch == '\n') {
                        if (linenum)
                                (*linenum)++;
                        buf[i] = '\0';
                        return NOFAIL(realloc(buf, i+1));
                }
                buf[i++] = ch;
        }
}

Even just replacing fgetc with fgetc_unlocked should help tremendously,
fgetc needs to lock the stream, so 2 atomic instructions per character
plus it is a library call.  fgetc_unlocked is inlined, needs no locking
(modprobe isn't threaded, right?) and if there is anything buffered
already (4095 out of 4096 times) it is just dereferencing a few pointers.
The malloc/realloc/free per each line in the file (in my case
1486 malloc/frees times 2 for modules.dep and 5872 for modules.alias
can be also avoided, from what I see all users of
getline_wrapped always free what it returned and it is never called
recursively.  Which means there is no reason to malloc/free this every time.
If instead this does:
	static int size = 0;
	static char *buf = NULL;
...
		if (i == size) {
			size = size ? size * 2 : 1024;
			buf = NOFAIL(realloc(buf, size));
		}
...
			return buf; /* No second realloc.  */
and never frees what getline_wrapped returned, you can always use the
same buffer.  Still, it is processing a significant amount of data,
so while the above two proposed simple changes could give 10% or so,
they won't kill the cost altogether.  There are other inefficiencies
all over, like:
        len = strlen(modname);                                                                                                   
        if (strchr(modname, '.'))                                                                                                
                len = strchr(modname, '.') - modname;
which gcc is at least ATM not able to optimize (while the duplicate
strchr is called just once, gcc will call strlen unconditionally, even when
it should be only called if strchr failed - though best is just
use len = strchrnul(modname, '.') - modname;
here.

With a prepared binary blob which would include a hash table for easy
finding of dependencies for a module we could just open that, and only
once.  If the binary blob starts with a magic number and then has
st_ino/st_dev/st_ctime of both modules.dep and modules.aliases in the
same dir, then you can easily keep the text files for handediting
purposes and only use the binary blob if the text files haven't
changed since the binary blob was recreated, and let the depmod or
whatever creates the binary files also create the binary blob and
have a mode in which it creates the blob from the text files rather
than from whatever it found on disk.
modules.dep searches just for the basenames of modules without extension
and dir component, and ignores differences between - and _.  So, just create
a hash function which handles - and _ the same and gives good results
on the usual module names and stick a hash table (e.g. with chains)
after the file header, then the chains would contain the actual module
name's basename for module_equal comparison, full 32-bit hash and pointer
to where the full strings are found in the string table (which should probably be
compressed using a directory table).  module.aliases is harder, because
it contains wildcards, so hash table isn't useful, but some kind of
tree, where each node would contain a length, sorted prefixes of that
length plus leaf nodes that contain wildcards at that point already and
thus need to be tested all by fnmatch.  Searching would just in each node
test all the wildcardish leaf nodes, if none matched binary search
for what matches the prefix and recurse into that node.
E.g. root node could have length of minimum of smallest alias name
and smallest number of chars before first wildcard in any alias name
(unless that is zero, otherwise the alias names starting with
wildcard would go into the set needed to go through fnmatch at that level).

	Jakub