[Date Prev][Date Next] [Thread Prev][Thread Next]
Re: [Linux-cluster] samba on top of GFS
- From: Alan Wood <chekov ucla edu>
- To: "Christopher R. Hertel" <crh ubiqx mn org>
- Cc: linux clustering <linux-cluster redhat com>
- Subject: Re: [Linux-cluster] samba on top of GFS
- Date: Mon, 15 Nov 2004 23:59:49 -0800 (PST)
Thanks for your help guys. Sorry it has taken me a while to get back to
you. I have been trying to come up with intelligent things to add but have
been stymied by what appears to be an ever-changing target. details below
Here's what I think I know:
1. Samba does indeed crash when running with "oplocks = no" AND on a
single node of a two-node GFS cluster where the other node is not doing
much of anything. So Christopher the answer to your questions is that it
seems to be _purely_ samba and gfs interacting that is setting up a crash.
2. Samba sometimes crashes in such a way that a "kill -9" will not
eliminate the crashed processes. In this situation attempting to start
samba again creates a [predictably] unstable situation in which kernel
oopses are evident.
3. Often samba simply starts to fail without outright crashing. This
renders both heartbeat monitoring and fencing rather useless. On the
client end this is evidenced by the ability to access a share and browse,
but severely degredaded performance when accessing individual files and
frequent crashes of windows explorer. On the server end, more and more smb
processes start up but old ones don't die...
Here are some observations. <disclaimer> Some of these could be completly
bogus because I am extrapolating on insifficient facts </disclaimer>.
1. Crashes are not nescessarily caused by having multiple accesses to the
same file. In one situation having 7 computers read/write to the same
directory (but never the same file) appears to have caused a complete
crash or a server running samba on top of GFS.
2. Crashes could be load related. I seem to be exponentially more
likely to see a crash with 50 concurrent users than with 5. Since
having many users increases the chances of a situation like #1 above, I can
see a possible correlation, but this would not explain all crashes.
3. Larger files and more full directories experience severe performance
degredation in the samba/gfs scenario. Simply right-clicking on a
networked file that is a few hundred megabytes can take minutes to pop up
a menu (assuming it doesn't crash windows explorer first).
4. Crashes occur on quota-enforced and non-quota-enforced system with no
discernable difference. However access to files on quota-ed systems might
I experiemented with a lot of different settings in the smb.conf, and while
the crashes were sometimes different I could not come up with a good
mapping of configs<=>crash causes. I have been very frustrated because
every time I think I've discovered something a new crash happens that
appears to prove me wrong. I need to create a seperate test environment
where I can set up idealized crash conditions in order to give this list
some more credible data because my current environment has a lot of
simultaneous access from multiple users whose actions aren't easily
monitorably or consistent. For instance, bob can't access a server drive
on windows computer one, he simply logs into computer two and then three
hoping to get a different result. Server crashes, but did it crash because
bob didn't wait long enough on the first access and exacerbated a fixable
problem to a crash? hard to tell...
hoping someone out there has seen something similar or can shed some light.
my system details:
I'm still running samba 3.0.7 on top of kernel 2.6.8-1.521
I tried updating to the newest CVS releases but ran into compile errors and
haven't had time to try again. so the GFS build is still mid-september.
I'd like to try the fixes that Patrick and David posted but think I am
going to try to compile cleanly with a 2.6.9 kernel.
I tried turning the loglevel up to 3. I get the following fairly often:
Nov 15 17:10:32 clu2 smbd: Error writing 5 bytes to client. -1. (Connection reset by peer)
Nov 15 17:10:32 clu2 smbd: [2004/11/15 17:10:32, 0] lib/util_sock.c:send_smb(647)
Nov 15 17:10:32 clu2 smbd: write_socket: Error writing 5 bytes to socket 24: ERRNO = Connection reset by peer
Nov 15 17:10:32 clu2 smbd: [2004/11/15 17:10:32, 0] lib/util_sock.c:write_socket(455)
Nov 15 17:10:32 clu2 smbd: write_socket_data: write failure. Error = Connection reset by peer
Nov 15 17:10:32 clu2 smbd: [2004/11/15 17:10:32, 0] lib/util_sock.c:write_socket_data(430)
otherwise I just see runaway processes that won't die or I get a fence
event with no apparent log entry leading to it.
On Mon, 1 Nov 2004, Christopher R. Hertel wrote:
On Mon, Nov 01, 2004 at 12:30:47PM -0800, Alan Wood wrote:
I am running a cluster with GFS-formatted file systems mounted on multiple
nodes. What I was hoping to do was to set up one node running httpd to be
my webserver and another node running samba to share the same data
What I am getting when running that is instability.
Yeah. This is a known problem. The reason is that Samba must maintain a
great deal of metadata internally. This works well enough with multiple
Samba processes running on a single machine dealing (more or less)
directly with the filesystem.
The problem is that Samba must keep track of translations between Posix
and Windows metadata, locking semantics, file sharing mode semantics, etc.
I had assumed that this would only be a problem if Samba was running on
multiple machines all GFS-sharing the same back-end block storage. Your
report suggests that there's more to the interaction between Samba and GFS
than I had anticipated. Interesting...
The samba serving node
keeps crashing. I have heartbeat set up so that failover happens to the
webserver node, at which point the system apparently behaves well.
Which kind of failover? Do you start Samba on the webserver node? It
would be interesting to know if the two run well together on the same
node, but fail on separate nodes.
After reading a few articles on the list it seemed to me that the problem
might be samba using oplocks or some other caching mechanism that breaks
Yeah... that was my next question...
I tried turning oplocks=off in my smb.conf file, but that
made the system unusably slow (over 3 minutes to right-click on a two-meg
...but did it fix the other problems?
I'd really love to work with someone to figure all this out. (Hint hint.)
I am also not sure that is the extent of the problem, as I seem to be able
to re-create the crash simply by accessing the same file on multiple
clients just via samba (which locking should be able to handle).
problem were merely that the remote node and the samba node were both
accessing an oplocked file I could understand, but that doesn't always seem
to be the case.
There's more here than I can figure out just from the description. It'd
take some digging along-side someone who knows GFS.
has anyone had any success running the same type of setup? I am also
serving nfs on the samba server, though with very little load there.
Is there any overlap in the files they're serving?
below is the syslog output of a crash. I'm running 2.6.8-1.521smp with a
GFS CVS dump from mid-september.
Wish I could be more help...
[Date Prev][Date Next] [Thread Prev][Thread Next]