[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

ext3 performance issue with a Berkeley db application

Can someone suggest anything that will help with the following ext3
performance problem?  (It's a Berkeley db issue at bottom, but the ext3
part is worth looking at, I think.)

First, two paragraphs of background:  A Bayesian spam filter called
bogofilter uses Berkeley db to maintain two database files of identical
format: one containing words found in spam email and for each word the
number of times that word has been seen; the other is the same except
the words come from nonspam email. ("Words" is a bit of an
oversimplification but that doesn't matter for the present purpose.)
One of bogofilter's functions is to update one or the other of these
database files with the contents of a new message known to be spam or
nonspam; another is to look up the words found in an unclassified
message and determine by combination of probabilities whether the
message is more likely to be spam or nonspam.

The performance problem is this: as the spam database grows (somewhere
between 350,000 and a half million records), time taken to classify a
message, or to register a new spam, grows nonlinearly; there may be a
quantum jump somewhere in that range.  This is very noticeable on an
ext3 filesystem with data=ordered: to build a list of 170,000 tokens
starting from scratch takes 38 seconds on one of my computers, but to
build a list of 530,000 tokens starting from scratch takes 24 minutes
on the same box.

Here's the ext3 part.  On another machine I got the following times to
build that list of 530,000 tokens, starting by creating empty db files
on a partition mounted as:

ext2:                3m 41s

ext3, data=ordered: 27m 59s

ext3, data=journal: 14m 11s

ext3, data=writeback: 4m 2s

I've discussed this with a senior bogofilter developer, Matthias
Andree, who says, "tell them we have many scattered pwrites and one
final fsync()."  You can see graphs of what's going on on his

1.  http://mandree.home.pages.de/bogofilter/writes.png --
the writes are pretty much unordered.

2.  which pages were accessed how frequently,

Some further findings:

o  It doesn't matter if the source (mbox) file is on ext3/ordered or
   ext2; the difference in time is insignificant.

o  Just processing a message to classify it takes about four times as
   long if the .db files are on ext3/ordered as it does if the .db
   files are on ext2.

o  Dumping the tokens and counts from the database in text form and
   reloading them into a new database file is not subject to serious
   performance problems; on the machine that needs 24 minutes to
   build from a 200-Mb mbox, rebuilding the database from a list of
   tokens took eight seconds -- this was on ext3 in ordered mode.

These data were obtained on machines running linux kernels
2.4.21-pre3-ac4 and -ac5 and 2.4.21-pre4-ac1; kernel 2.4.20-ac2 appears
to give similar results though this has not been thorougly tested. 
Results like those reported were initially obtained with db-3.1.17; the
tests shown here used db-4.1.25.

More info available on request; tuning hints most gratefully received
and tested.

| G r e g  L o u i s          | gpg public key:      |
|   http://www.bgl.nu/~glouis |   finger greg bgl nu |
| Help free our mailboxes. Include                   |
|        http://wecanstopspam.org in your signature. |

[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]