[Linux-cachefs] Kernel BUG: CacheFiles: Error: Unexpected object collision
Mark Moseley
moseleymark at gmail.com
Tue May 18 16:45:00 UTC 2010
On Wed, May 12, 2010 at 2:28 PM, Mark Moseley <moseleymark at gmail.com> wrote:
> I've been running cachefilesd 0.10.1 since yesterday on this box and
> got this (attached) BUG traceback. System was unresponsive after that.
> Kernel is 2.6.33.3 with the suite of patches that David H put out the
> other day in the thread "Possible patch for CacheFiles: I/O Error:
> Unlink failed" (I actually applied the broken-out patches repackaged
> by Romain DEGEZ). Without the patches, cachefilesd dies after about 45
> minutes with the "Unlink failed" error. With the patches, it's run all
> the way since yesterday afternoon before dying a few minutes ago with
> this error (that I've not seen before). The system is a Dell Poweredge
> 1950, running Debian Lenny 32-bit, with a fairly NFS-intensive
> workload. I don't have the exact disk usage from right before it died
> but a 'df' approx 30 mins earlier showed that it had a little shy of 9
> gig used in the cache (with 58g free). I didn't do a df -i any time
> recently on it, so I don't know how many entries were in there but the
> vast majority is html and image files, so probably averaging in the
> 1-100k range, so quite a few entries.
>
> A few hours ago I happened to look at the /sys stats (but not since,
> so this is probably a few hours prior to BUG):
>
> # cat /proc/fs/fscache/stats
> FS-Cache statistics
> Cookies: idx=5436 dat=876190 spc=0
> Objects: alc=796439 nal=0 avl=796402 ded=687802
> ChkAux : non=0 ok=457637 upd=0 obs=572
> Pages : mrk=3678901 unc=3250437
> Acquire: n=881626 nul=0 noc=0 ok=881626 nbf=0 oom=0
> Lookups: n=796793 neg=339151 pos=457288 crt=339151 tmo=354
> Updates: n=0 nul=0 run=0
> Relinqs: n=772550 nul=0 wcr=37 rtr=14494
> AttrChg: n=0 ok=0 nbf=0 oom=0 run=0
> Allocs : n=0 ok=0 wt=0 nbf=0 int=0
> Allocs : ops=0 owt=0 abt=0
> Retrvls: n=1008965 ok=541779 wt=229502 nod=376000 nbf=91186 int=0 oom=0
> Retrvls: ops=917779 owt=232622 abt=0
> Stores : n=1547630 ok=1547630 agn=0 nbf=0 oom=0
> Stores : ops=387729 run=1935352 pgs=1547623 rxd=1547630 olm=0
> VmScan : nos=3218178 gon=0 bsy=4 can=7
> Ops : pend=232917 run=1305508 enq=4102804 can=0 rej=0
> Ops : dfr=903 rel=1305508 gc=903
> CacheOp: alo=0 luo=0 luc=0 gro=0
> CacheOp: upo=0 dro=0 pto=0 atc=0 syn=0
> CacheOp: rap=0 ras=0 alp=0 als=0 wrp=0 ucp=0 dsp=0
>
>
> # cat /proc/fs/cachefiles/histogram
> JIFS SECS LOOKUPS MKDIRS CREATES
> ===== ===== ========= ========= =========
> 0 0.000 1694602 7529 336885
> 1 0.010 4126 50 1758
> 2 0.020 201 4 139
> 3 0.030 46 1 42
> 4 0.040 21 1 26
> 5 0.050 22 0 21
> 6 0.060 13 0 24
> 7 0.070 4 0 15
> 8 0.080 14 0 21
> 9 0.090 11 0 16
> 10 0.100 8 0 14
> 11 0.110 7 0 13
> 12 0.120 6 0 13
> 13 0.130 13 0 8
> 14 0.140 10 1 16
> 15 0.150 6 1 9
> 16 0.160 5 0 7
> 17 0.170 3 2 12
> 18 0.180 3 0 8
> 19 0.190 3 0 5
> 20 0.200 4 0 9
> 21 0.210 5 0 3
> 22 0.220 3 0 11
> 23 0.230 0 0 6
> 24 0.240 1 0 7
> 25 0.250 1 0 9
> 26 0.260 4 0 7
> 27 0.270 1 0 4
> 28 0.280 1 0 6
> 29 0.290 4 0 1
> 30 0.300 1 0 7
> 31 0.310 1 0 5
> 32 0.320 2 0 5
> 33 0.330 2 0 3
> 34 0.340 1 1 2
> 35 0.350 0 0 6
> 36 0.360 0 0 1
> 37 0.370 0 1 4
> 38 0.380 0 0 6
> 39 0.390 0 0 1
> 40 0.400 0 0 5
> 41 0.410 2 0 4
> 42 0.420 0 0 4
> 43 0.430 0 0 5
> 44 0.440 1 0 4
> 45 0.450 1 0 1
> 46 0.460 1 0 1
> 47 0.470 0 0 2
> 48 0.480 0 0 2
> 49 0.490 0 0 2
> 51 0.510 0 0 3
> 52 0.520 1 0 3
> 53 0.530 0 0 3
> 54 0.540 0 0 2
> 55 0.550 0 0 1
> 56 0.560 0 0 2
> 57 0.570 0 0 1
> 58 0.580 0 0 2
> 59 0.590 1 0 2
> 60 0.600 0 0 2
> 61 0.610 0 0 1
> 62 0.620 1 0 1
> 66 0.660 0 0 1
> 69 0.690 0 0 1
> 71 0.710 0 1 0
> 72 0.720 0 0 1
> 73 0.730 0 0 1
> 74 0.740 0 0 1
> 76 0.760 0 0 2
> 78 0.780 0 0 1
> 81 0.810 0 0 2
> 82 0.820 0 0 2
> 83 0.830 0 0 1
> 89 0.890 0 0 1
> 99 0.990 0 0 6
>
> I'll be happy to try anything out, either patch-wise or research-wise. thx
>
Anybody else seen this error before? The comments in the code say:
/* an old object from a previous incarnation is hogging the slot - we
* need to wait for it to be destroyed */
If it's an object hanging around since a previous incarnation, does
that mean that it's better to wipe the cache/ directory at each
startup?
More information about the Linux-cachefs
mailing list