Linux housekeeping: Handling archives and backups
Every sysadmin knows, or should know, that performing and managing backups is an essential part of being a system administrator. If you read 5 Linux backup and restore tips from the trenches, you know how to perform and manage backups. But managing the space required to perform those backups is a very different topic. This is part of Linux housekeeping that you need to consider in your daily workflow.
In this article, I touch on critical pieces of backup management such as location, retention, disposal and disposition, and automation. Your environment and policies will dictate solutions that you implement for backup space management, but these guidelines and recommendations will help you if you're struggling to handle a growing amount of idle data stored on your network.
Location, location, location
Everyone has heard the old real estate saying that states that location is the most important aspect of a property. Storage space is like real estate property in that location is very important. The three location values are listed below.
- High-speed disks—Initial and short-term backup storage
- Network-attached storage—Public drives and shared spaces
- Offsite storage and cloud-based storage—long-term storage
High-speed disk storage includes locally-attached storage, flash arrays, and SAN storage. This location is where backups initially land, but they are not stored there long-term. These types of storage are far too expensive to use for backup storage. Data should be moved to lower-cost storage immediately following backups.
You should also separate backups from production data. In other words, don't collect backups onto the same disks where you have databases, transaction logs, or other write-intensive services storing data.
Network-attached storage is generally for data-in-use applications such as public drives and shared spaces. These areas are not for backups; backup sets are generally too large to store in public spaces. Public drives and shared spaces also need to be backed up, which is another reason for not storing backup sets in these areas.
Separate network-attached backup storage can be provisioned to store data for long-term access and use. In the 3-2-1 rule, this would be one of the two copies of data stored on different media. Once data has been transferred to network-attached storage, it can be replicated onto a separate array, archival media, or to some type of offsite storage.
[ Thinking about a cloud strategy? See why enterprises choose open hybrid cloud. ]
Offsite data storage is data that will remain unchanged and not accessed by users or administrators. It is static and is stored for disaster recovery purposes. Some administrators backup from network-attached storage to some form of portable disk storage for offsite delivery and storage.
Alternatively, administrators can transfer data from network-attached storage to cloud-based or hosted storage for disaster recovery purposes.
Of all the housekeeping problems facing system administrators, data retention is the most controversial and the most painstaking. Backup retention is always great fodder for debate among system administrators. Retention refers to the amount of time that you keep backups in case of a disaster event or in case of a need for a full restore. While these instances are rare, you still need a contingency for them should they occur.
My suggestions are as follows:
- Critical data—0 days to 6 months
- User data—7 to 30 days
- Transactional data—3 days
- Legacy data—Permanent on non-attached disks
Backups require a lot of storage space, and the longer you retain data, the more space is needed. Retention is good, but overdoing it wastes resources. Unaccessed data more than 30 days old should be archived where it's still accessible if needed. At some point, you have to feel comfortable about throwing away old, outdated data.
Disposal and disposition
I don't know about you, but destroying data, even with permission or a directive to do so, always makes me feel a little uncomfortable. Dismantling drives used for storage makes me feel wasteful. Shredding tapes is perhaps the worst of all. Maybe I'm a hoarder at heart, but I can't help it. My gut tells me to protect all data. This obsession with keeping too much data is almost as bad as an obsession with destroying data.
Disposal and disposition are not the same. There is a distinction. They have the same end goal, which is to remove data, and possibly hardware, from your network.
Disposal is the removal of data and the hardware it resided on without regard for the eventual consequences of that disposal. That is to say that this type of removal doesn't necessarily involve the safe removal of data or hardware. Sometimes data isn't erased at all, but the hardware itself is thrown away or recycled. This careless removal can result in security breaches and environmental impact.
Disposition is the responsible removal of data and hardware. It is usually governed by policy. For data, it means the irreversible destruction of data by means that makes it impossible to retrieve. For hardware, disposition refers to responsible recycling, recommissioning, or return to the manufacturer, perhaps as part of an exchange or an upgrade program.
I have successfully set up fully automated backup systems for multiple clients. My automation usually consists of automated backups, automated backup sets moved to more permanent storage, and automated archiving of data that no one has accessed in more than six months.
The "trick" to automating backups, or any complex set of tasks, is timing. You have to wait until you finish your first task before beginning the second, and so on. This is especially difficult with backups because the duration of a backup is unpredictable. Rather than waiting a specific amount of time for a backup or other task to finish, I set up a process check. If my process is still running, the next task doesn't begin. This method ensures that you never miss a backup set or fail to move a set to alternative storage.
[ Looking for more advanced system automation? Get started with The Automated Enterprise, a free book from Red Hat. ]
Managing space for backups is a major pain point for system administrators. Space is always a problem, and data grows at a very high rate. As system administrators, we are always looking for more space. Users require more space. Logs continually grow. Databases constantly grow. Programs are becoming larger. And every bit of hardware sends, receives, or stores data. You need to retain data according to policy and perform backups, but you also have a responsibility to preserve valuable disk space.