[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

Re: [K12OSN] Server Help! (a little desperate)



Shawn Powers wrote:

Wow...

Things have been going great this year, our entire district is using thin clients. Here's a very brief breakdown of how things are running:

1 Server handles DNS, TFTP, DHCP, NIS
1 Server handles NFS (/home), SMB
1 Server handles LTSP (running 4.0.1, but the TFTP and DHCP are farmed out to the other server)


For some reason, I've had 2 major "glitches" this year.

Last week, eth0 (where clients connect) just quit responding. The server appeared fine, but 10.10.10.10 was not pingable. After a brief panic, I just ran ifdown eth0, and ifup eth0 -- and I've had no problems until today. They started right after I left for lunch, of course.

Today, the LTSP server quit responding altogether. When going to the console, I couldn't even get THAT to come up. I power cycled the machine, and everything has come up just peachy -- BUT I'm very worried now.

I'm getting some "I told you so's" from the staff, who accused me that putting all my eggs in one basket was a bad idea, and with linux you get what you pay for, etc, etc, etc...

My question? Where do I start looking for some problems? I've read just about every bit of text in /var/log -- and nothing looks fishy. At 13:00, messages just stopped being written to /var/log/messages. There were no odd entries before it stopped.

Are there other logs I should be checking? Perhaps after school today, I'll take the server down and run memtest... Especially during this first year, I need close to 100% uptime, and I've had bad luck so far.



That's one of those things... all is fine and suddenly ... BOOM !

I have seen this (too) many times before, a couple tips :
run a check for badblocks on the harddisk(s).
Change your logging, If it is the disk/controller that goes bad, the system isn't able to write to the logs to report the failure...
You can have your syslogs going to another server (the NFS one for example)
Use some resource monitoring tool to keep an eye on processor/memory usage.
Where is the network load/processor/memory (incl swap) at seconds before the system went down?


As said, I have seen this before, in most cases it was either the harddisk or the controller that went (slowly) bad.
In one case it was the NIC, bad NICs can start 'sending' random bits over the lines.
If nobody is connected to the LTS, is the NIC still 'sending' stuff, or is it all quite?


Good luck

Peter

--

Any technology distinguishable from
foodoo-magic is insufficiently advanced.


Peter Van den Wildenbergh


CriticalControl Solutions Inc.
Bow Valley Square II
Suite 2400
205 - 5th avenue SW
Calgary, AB T2P 2V7

T 403.705.7500
F 403.705.7555



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]