4KSTACKS et al...

Paul A Houle ph18 at cornell.edu
Tue Aug 2 17:10:08 UTC 2005


    A few weeks ago we had a 4-way amd64 web server running RHEL 4 that 
crashed sporadically -- nothing left in the syslog.  up2date didn't find 
a new kernel,  so I just downloaded and installed the latest kernel from 
kernel.org and the system has been stable ever since.  I'm not sure if I 
could have gone to RH for support because Cornell has a site license,  
and even if I had a direct line to RH management,  it would take me more 
time to explain the problem than it would take to try a mainstream kernel.

    Overall,  I'm quite happy with the four-digit revision mainstream 
linux kernels.  We had a crash on our main machine that left a stack 
trace,  did some research on the web,  found that this had been fixed in 
2.6.11.something,  upgraded the kernel,  case closed.

    People are willing to pay $$ to get an "enterprise" product which is 
reliable,  and supported,  but this is another case where the generic 
product turns out to be more reliable than the branded product,  and 
looking at what's happening with Fedora,  I've got a lot of concern that 
RH's pursuit of innovation will always lead to a kernel long on gee-whiz 
features and short on reliability.  Crashes mean I get calls from the 
NOC at 4am,  and god forbid that my toddler hears the phone ring or me 
walking down the stairs,  because I'll need to entertain him while 
dealing with the crash and for the rest of the morning.  Then a week 
later I go to netcraft and they say my uptime is seven days and I feel 
like a jerk because the whole world knows about my problems.

    I think there are two reasons for the RHEL 4 instability:  (i) the 
quarterly release cycle means that I have to wait for bug fixes -- and 
if you're running a non-x86 architecture,  it seems like 2.6 is shaking 
out bugs at a high rate,  and (ii) RH is aggressively pushing new features.

    I really don't know what's in RHEL 4 (it would take me more time to 
look at the patches than it would to revert to mainstream) but the 
activation of 4KSTACKS in Fedora is one of those changes that reduces 
reliably.

    I've been looking,  and I've never found out what benefit that 
4KSTACKS has for end users.  The kernel team is sensible,  so I'm sure 
that there are some real benefits,  but looking at the problem reports 
and at the attitudes of some people on this list,  I start to wonder if 
it's just a vindicitive attempt to put an end to ndiswrappers.  (I'd 
really love to see an explanation of the benefits of 4KSTACKS)

    The real trouble is that 4KSTACKS problems aren't in kernel modules 
per se,  but really are in the combination of modules that are running.  
Yeah,  maybe they can get reiserfs running under 4KSTACKS,  but what if 
you're running an NFSv4 server with all the whizzy options turned on,  
and IPv6 with tunneling and it's a reiserfs filesystem and you're using 
LVM and RAID and a particularly funky SCSI driver,  what then?

    By adopting 4KSTACKS early,  Fedora has helped shake out problems 
with 4KSTACKS,  but when 4KSTACKS becomes the main option in the 
mainstream kernel,  we'll see people dealing with weird problems that 
happen sporadically on certain setups for years to come.  We seem to 
have one of the worst workloads in the world,  and the last thing I need 
is more crashes.




More information about the fedora-devel-list mailing list