Linux users increasingly rely on massive file systems to house their business critical data sets. When I started to work in IT and entered data centers, colleagues were pointing at a rack containing a storage system, and respectfully explaining that this could contain one terabyte of data. Storage resources have grown by scales of magnitudes over the last few years—now often spanning many terabytes or reaching petabyte scale.
What is bit rot?
Of course, we expect to also reliably read what we wrote to storage. Hard disks and SSDs have an impressively low probability of giving different data to the upper layers than were written before. With increasing storage capacity, the chances of also getting back wrong data are rising. If a storage, for example a spinning disk, can not read a sector then it will report an I/O error to the upper layers.
When we get different data back from the disk than we wrote, that is called bit rot. What will happen in that case; will Red Hat Enterprise Linux notice? In which ways can we deal with this situation?
The following steps use a RHEL 8 system. Even a small KVM guest is enough so you can try out these commands for yourself.
Will RHEL file systems detect bit rot?
By definition, with bit rot we are getting different data from the block device than we wrote. Thus, if an application like a database is using the block device directly without a filesystem layer, then it would have to deal by itself with bit rot.
Let’s look at bit rot on a block device with XFS, the default filesystem on RHEL 7 and RHEL 8. Instead of using a real harddisk and waiting for bit rot, we will change a single bit, to simulate bit rot.
Let’s start in generating a 128MB file, consisting of zeros. We will then use losetup to make the file available as block device, create an XFS file system and mount it.
# dd if=/dev/zero of=rawfile bs=1M count=128 # MYLOOP=$(losetup -f --show rawfile) # echo $MYLOOP /dev/loop27 # mkfs.xfs $MYLOOP # mount $MYLOOP /mnt
We will now store a single file on the file system, filled with the ASCII character y
and newline characters. After unmounting the file system, we will look at some bytes at the offset of 50MB.
# yes >/mnt/infile yes: standard output: No space left on device # md5sum /mnt/infile 66e48b263b313703fce56a8a5a848eef /mnt/infile # umount /mnt # losetup -d $MYLOOP # dd if=rawfile bs=1 count=10 skip=$((50*1024*1024)) 2>/dev/null|hexdump -vC 00000000 79 0a 79 0a 79 0a 79 0a 79 0a |y.y.y.y.y.| 0000000a
At offset 50MB, we have the character code for y
, written in hexadecimal 0x79, or in binary 01111001. We will now flip the last bit into a 0, which makes this into hex 0x78, the ASCII code for x
. After changing the character, we verify what we wrote with hexdump
.
# echo 'x' | dd of=rawfile bs=1 count=1 seek=$((50*1024*1024)) conv=notrunc # dd if=rawfile bs=1 count=10 skip=$((50*1024*1024)) 2>/dev/null|hexdump -vC 00000000 78 0a 79 0a 79 0a 79 0a 79 0a |x.y.y.y.y.| 0000000a
Now we can mount the file again as filesystem, and check if the changed data gets noticed.
# MYLOOP=$(losetup -f --show rawfile) # mount $MYLOOP /mnt/ # md5sum /mnt/tmp/infile adf92be755d095e898281b1146be72ce /mnt/infile
The checksum changed, in other words: we are getting different data from our underlying block device. Nothing is at this point hinting at the data having changed: the supported filesystems in RHEL like Ext4 or XFS do not have checksums over data. As of RHEL 8, both have their metadata, so the structures they use for their own "housekeeping" protected with checksums, but this does not cover the real data.
For more sophisticated introduction of errors, dm-dust can be considered. Dm-dust emulates the behaviour of bad sectors, it is currently not in RHEL but for example in Fedora.
How can I detect bit rot?
With RHEL8 and later we can detect bit rot, thanks to the dm-integrity kernel code. It uses checksums to detect bit rot. Let’s look at our bit rot situation with dm-integrity as additional layer.
# dd if=/dev/zero of=rawfile bs=1M count=128 # MYLOOP=$(losetup -f --show rawfile) # integritysetup format $MYLOOP # integritysetup open $MYLOOP mydata # mkfs.xfs /dev/mapper/mydata # mount /dev/mapper/mydata /mnt # yes >/mnt/infile # md5sum /mnt/infile 13e14c50aaf2054d987663ed31b5f786 /mnt/infile # umount /mnt/ # losetup -d $MYLOOP # integritysetup close mydata # dd if=rawfile bs=1 count=10 skip=$((50*1024*1024)) 2>/dev/null|hexdump -vC 00000000 79 0a 79 0a 79 0a 79 0a 79 0a |y.y.y.y.y.| 0000000a # echo 'x' | dd of=rawfile bs=1 count=1 seek=$((50*1024*1024)) conv=notrunc # dd if=rawfile bs=1 count=10 skip=$((50*1024*1024)) 2>/dev/null|hexdump -vC 00000000 78 0a 79 0a 79 0a 79 0a 79 0a |x.y.y.y.y.| 0000000a
Now let’s see how we do with dm-integrity as additional layer:
# MYLOOP=$(losetup -f --show rawfile) # integritysetup open $MYLOOP mydata # mount /dev/mapper/mydata /mnt # md5sum /mnt/infile md5sum: /mnt/infile: Input/output error
What happened here? Using dm-integrity we noticed that the underlying data changed, and notified the layer above with the I/O error. Without dm-integrity, the changed bit was not noticed, now we get an I/O error. When using cp
, we will notice that cp will copy the first part of the file, and stop when it receives the I/O error. With the following command, we can ask dd
to continue reading the file despite I/O errors:
# dd if=/mnt/infile of=/tmp/infile.copy conv=noerror
The ddrescue
utility, which is not part of RHEL, can be used to create a copy of the intact data, and create a ‘hole’ in the destinationfile which is exactly as big as the data which could not be read. ‘ddrescue’ works on sectors, so in our example the destination file has a hole of 4096 bytes.
How is bit rot handled, in general?
The bit rot topic is known across the industry and approached in various software projects.
-
dm-integrity
is able to detect bit rot. As a device mapper layer, it can be used with various other layers on top, like file systems, LUKS, LVM or compression layers. -
File systems like btrfs or ZFS consider dealing with bit rot. Btrfs seems to have bit rot detection, and also does not handle corrections automatically.
-
Ceph is a distributed file system. With the BlueStore backend, by default all data and metadata written to BlueStore are protected by one or more checksums. Data and metadata from the disk are verified before handover to the user. So, we have detection and report of corruptions when data is read, but no automatic fixing.
Summary and the future
In this post, we have looked at what bit rot is and how to detect it with dm-integrity
. File systems like Btrfs and ZFS also aim at dealing with bit rot, but are, for various reasons, not available in RHEL.
The checksumming done by dm-integrity
leads to more I/O to the underlying block device, and less usable space as also the checksums are stored. Alternatively to setups with integritysetup
, dm-integrity
can be used together with LUKS, which provides encryption.
In this post, we have not looked at fixing bit rot-induced errors. For this, data should be stored multiple times, for example with dm-integrity directly on top of disks, and on top of that RAID1. If bit rot is detected, the healthy half of the mirror can be used.
Regarding support status: dm-integrity is not labeled as TechPreview, it is supported. Heavy stacks of RAID1 on top, with mdadm or LVM-raid, are not yet widely tested or recommended for production.
Related to our topics here are DIF/DIX, Data Integrity Field/Data Integrity Extension. DIF is a standard to compute a checksum and store it on the disk, to be able to verify the integrity of the stored data. DIX uses checksums to protect data while it traverses various storage layers of systems, intended to be implemented by storage at the bottom and application on the top, to detect issues. This kbase solution has details.
Detecting bit rot solves half the problem - but it's more than half of the work. What's left is to build an automated way to use RAID to fix corrupted data by using a known good copy (that is, one with a valid checksum) to recreate the corrupted segments on a different part of the disk. This part is still being worked on, with Red Hat and in the upstream communities.
執筆者紹介
Christian Horn is a Senior Technical Account Manager at Red Hat. After working with customers and partners since 2011 at Red Hat Germany, he moved to Japan, focusing on mission critical environments. Virtualization, debugging, performance monitoring and tuning are among the returning topics of his daily work. He also enjoys diving into new technical topics, and sharing the findings via documentation, presentations or articles.
チャンネル別に見る
自動化
テクノロジー、チームおよび環境に関する IT 自動化の最新情報
AI (人工知能)
お客様が AI ワークロードをどこでも自由に実行することを可能にするプラットフォームについてのアップデート
オープン・ハイブリッドクラウド
ハイブリッドクラウドで柔軟に未来を築く方法をご確認ください。
セキュリティ
環境やテクノロジー全体に及ぶリスクを軽減する方法に関する最新情報
エッジコンピューティング
エッジでの運用を単純化するプラットフォームのアップデート
インフラストラクチャ
世界有数のエンタープライズ向け Linux プラットフォームの最新情報
アプリケーション
アプリケーションの最も困難な課題に対する Red Hat ソリューションの詳細
オリジナル番組
エンタープライズ向けテクノロジーのメーカーやリーダーによるストーリー
製品
ツール
試用、購入、販売
コミュニケーション
Red Hat について
エンタープライズ・オープンソース・ソリューションのプロバイダーとして世界をリードする Red Hat は、Linux、クラウド、コンテナ、Kubernetes などのテクノロジーを提供しています。Red Hat は強化されたソリューションを提供し、コアデータセンターからネットワークエッジまで、企業が複数のプラットフォームおよび環境間で容易に運用できるようにしています。
言語を選択してください
Red Hat legal and privacy links
- Red Hat について
- 採用情報
- イベント
- 各国のオフィス
- Red Hat へのお問い合わせ
- Red Hat ブログ
- ダイバーシティ、エクイティ、およびインクルージョン
- Cool Stuff Store
- Red Hat Summit