Well, I thought this would be an interesting topic to explore because I'm working in a team most sysadmins probably dream of—we don't work over holidays, nights, or weekends. Our mode of operation dictates that everyone works eight hours a day from Monday to Friday. Office hours are between 6:45 am and 8:00 pm, but most of us keep to the typical workday schedule of 8:00 am to 5:00 pm with an hour for lunch.
That means there is no operations team present onsite or even on call during nights, weekends, and holidays. (Full disclosure—we do put in an extra hour here and there, but only for some of the bigger migrations, e.g., a new SAN goes live, or we do a migration of a whole mail system including mailboxes and calendars. Some guys will briefly check-in at night or during the weekend to see if the jobs are still running. But, as a rule, other typical maintenance tasks like release upgrades take place during the usual office hours).
Sounds awesome, right? Unfortunately, we're not hiring at the moment.
When I tell people that we don't have to work on weekends and holidays and don't have to be on call at night, they usually ask, "but what do you do when something breaks?"
So we work 8x5, but our clients need our services to be available 24x7. To make that possible, you have to consider a few important things when it comes to product selection and service design.
Hardware will fail
Yes, even yours. It's only a matter of time. It could be a power outage, flooding, theft, fire, or other destruction of your datacenter site, or even simple hardware failure, which happens to everyone. There are a lot of reasons that could and will lead to system failure, so when you design a new service, be sure to factor in how to keep that service up and running in the event that an entire data center becomes unavailable. I give you some examples of how to achieve that.
Location, location, location
When your data center is in a single location, and this location becomes unavailable for whatever reason, you might consider stretching your data center over two locations. And that's what we've done.
Our virtualization platform consists of several stretched clusters. For example, if a cluster has eight hypervisor hosts, four are placed in Location 1 and the other four in Location 2. We're only using up to 50% of the overall cluster resources. So if we lose one location, the high availability mechanisms kick in and recover the lost virtual machines in the second location. You might have a small downtime—and therefore, a service impact—here, but it will recover automatically without human interaction.
Often, there is more than one virtual machine (VM) delivering a service. In the case that there are two or more virtual machines for one service, we distribute them to the different locations. So in the event that one site becomes unavailable and one VM has to be restarted in the second location, the service remains available on the other VM.
To make this possible, we backed our virtualization platform with a SAN with synchronous block replication. So if we lose half of our virtualization platform and half of our storage subsystems, we are still able to operate without or just a minimal service impact.
Two are better than one
If there is only one firewall device, or one load balancer, or only one switch, this device becomes a single point of failure. When it fails, all connected devices and services will fail as well (see Fig. 2).
So we try to cluster almost everything and stretch it over both of our locations. This way, we could lose a whole site with minimal service impact. So the logical architecture remains the same, as in Fig. 2, but the physical layout looks more like Fig 3.
If some device or application is not able to operate in stretched cluster mode, we try to avoid using it at all. When a person or department insists on buying such an application, they may have to agree to a longer downtime. Because, when it breaks on Friday evening, it would not be fixed before Monday morning.
What comes with it?
Usually, an organization has to spend big on overtime pay to get work done on holidays, nights, and weekends, and you would need 24x7 service contracts to get help from a vendor during these times.
Here it comes. If you don't work during those unfortunate times, your boss doesn't have to pay for it, and you might be good with service contracts that cover only 8x5 (local office hours) and save a lot of money there.
Of course, for us, there is a special time period during the year where we have downtime built-in—the semester break. When most students are gone, there is not as much load on our systems, so we use this time to upgrade, patch, and renew our systems with minimal impact for our users.
I know I'm lucky as a sysadmin to have normal working hours. I have to say, I love this job for a lot of reasons, but the awesome work/life balance that comes with it is definitely one of the biggest.
[ Want to test your sysadmin skills? Take a skills assessment today. ]