[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

[Linux-cluster] gnbd problem


I am running CentOS 4.4 with the cluster suite in a GNBD + GFS solution. The dual onboard nics are bonded in alb on the client nodes and the gnbd server has 4 nics bonded on alb. The GNBD server has a 3ware 16 channel controller on raid 6. The network aggregate throughput is great and so is the performance on GFS. This GFS installation is replacing my current Lustre installation.

Here is the problem - On heavy load like copying lots of very big files or dd in a loop (from many hpc nodes simultaneously), i get the following error messages -

gnbd (pid 5296: cp) got signal 9
gnbd0: Send control failed (result -4)
gnbd0: Send data failed (result -104)
gnbd0: Receive control failed (result -32)
gnbd0: shutting down socket
exitting GNBD_DO_IT ioctl
resending requests

One time with iozone, the gfs mount froze on all nodes. But once i disabled the oopes ok option in mount, that problem seems to have gone away. Hopefully if there is an oops, that node will panic and won't freeze gfs for the rest of the nodes. If i update the gnbd from 1.0.8 to 1.0.9, will it fix the gnbd error messages on heavy load? The clients are using GNBD fencing. I am a bit concerned that the gnbd client when re-opening connection with gnbd server could cause corruption of data or freeze of gfs mount. Is that a possibility? Since the other parts of cluster suite are working fine and it is just the gnbd client that is having problems, the gnbd fencing probably won't kick in.

I have another related question. During a power outage, the hpc nodes would shutdown in 5 minutes and only the master node and storage server would run on the ups with battery pack for another hour or so. The master node re-exports the gfs mount via nfs to our infrastructure servers. In this scenario, when all the hpc nodes are down, the cluster loses quorum. Will the gfs mount on the master node freeze when the cluster loses quorum? If it does, is there a way around it, like maybe lots of votes for master node for example? In Lustre, this scenario is possible. I can have a single server with mounted lustre volumes still up with all the other nodes down due to a power outage. Thanks very much.

Balagopal Pillai

[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]