SSDs are now commonplace and have been the default choice for performance-oriented disks in the enterprise and consumer environments for the past few years. SSDs are cool and fast but most people on high-end machines face this dilemma: My SSD is behind a RAID controller which doesn't expose the device's DISCARD or TRIM capabilities. How do I discard the blocks to keep the best SSD performance? Here's a trick to do just that without having to disassemble your machine. Recent improvements in SSD firmware have made the need for the applications writing to SSDs less stringent to use DISCARD/TRIM.

There are, however, some cases in which you may need to have the filesystem inform the drive of the blocks which it discarded. Perhaps you have TLC (3bits per cell) or QLC (4bits per cell) drives instead of the usually more expensive enterprise-class SLC or MLC drives (the latter are less susceptible to a performance drop since they put aside more extra blocks to help with overwrites when the drive is at capacity). Or maybe you once filled your SSD to 100%, and now you cannot get the original performance/IOPS back.

On most systems, getting the performance back is usually a simple matter of issuing a filesystem trim (fstrim) command. Here's an example using a Red Hat Enterprise Linux (RHEL) system:

[root@System_A ~]# fstrim -av
/export/home: 130.5 GiB (140062863360 bytes) trimmed
/var: 26.1 GiB (28062511104 bytes) trimmed
/opt: 17.6 GiB (18832797696 bytes) trimmed
/export/shared: 31.6 GiB (33946275840 bytes) trimmed
/usr/local: 5.6 GiB (5959331840 bytes) trimmed
/boot: 678.6 MiB (711565312 bytes) trimmed
/usr: 36.2 GiB (38831017984 bytes) trimmed
/: 3 GiB (3197743104 bytes) trimmed
[root@System_A ~]#

[ Readers also liked: Linux hardware: Converting to solid-state disks (SSDs) on the desktop ]

There's one catch, though...

If your SSDs are behind a RAID volume attached to a RAID controller (HPE's SmartArray, Dell's PERC, or anything based on LSI/Avago's MegaRAID), here's what happens:

[root@System_B ~]# fstrim -av
[root@System_B ~]# 

Just nothing. Nothing will happen. At the end of the SCSI I/O chain, the capabilities of a device boil down to the device itself, and the RAID driver your drive is attached to.

Let's take a closer look. Here's an SSD (a Samsung EVO 860 2Tb drive) attached to a SATA connector on a RHEL system (we will name that system System_A in the rest of this document):

[root@System_A ~]# lsscsi 
[3:0:0:0]    disk    ATA      Samsung SSD 860  3B6Q  /dev/sda 

Here's an identical drive (same model, same firmware) behind a RAID controller (a PERC H730P) on a different system (let's call that system System_B in the rest of this document):

[root@System_B ~]# lsscsi 
[0:2:0:0]    disk    DELL     PERC H730P Adp   4.30  /dev/sda 

How do I know it's the same drive? Thanks to the use of megaclisas-status, the RAID HBA can be queried. It shows this:

[root@System_B ~]# megaclisas-status
-- Controller information --
-- ID | H/W Model          | RAM    | Temp | BBU    | Firmware     
c0    | PERC H730P Adapter | 2048MB | 60C  | Good   | FW: 25.5.7.0005 

-- Array information --
-- ID | Type   |    Size |  Strpsz |   Flags | DskCache |   Status |  OS Path | CacheCade |InProgress   
c0u0  | RAID-0 |   1818G |  512 KB | ADRA,WB |  Enabled |  Optimal | /dev/sda | None      |None         

-- Disk information --
-- ID   | Type | Drive Model                                      | Size     | Status          | Speed    | Temp | Slot ID  | LSI ID  
c0u0p0  | SSD  | S3YUNB0KC09340D Samsung SSD 860 EVO 2TB RVT03B6Q | 1.818 TB | Online, Spun Up | 6.0Gb/s  | 23C  | [32:0]   | 0   

Yes, it's the same drive (Samsung EVO 860) and the same firmware (3B6Q).

Using lsblk, we'll expose the DISCARD capabilities of those two devices:

[root@System_A ~]# lsblk -dD
NAME     DISC-ALN DISC-GRAN DISC-MAX DISC-ZERO
sda             0      512B       2G         1
[root@System_B ~]# lsblk -dD
NAME      DISC-ALN DISC-GRAN DISC-MAX DISC-ZERO
sda              0        0B       0B         0

Here's the culprit. All of the values are zero. The SSD in a RAID 0 behind a PERC H730P on System_B does not expose any DISCARD capabilities. This is the reason why fstrim on System_B did not do or return anything.

HPQ SmartArray systems are affected in a similar way. Here's an HPE DL360Gen10 with a high-end SmartArray RAID card:

[root@dl360gen10 ~]# lsblk -dD
NAME DISC-ALN DISC-GRAN DISC-MAX DISC-ZERO
sda         0        0B       0B         0
sdc         0        0B       0B         0
sdd         0        0B       0B         0
sde         0        0B       0B         0
sdf         0        0B       0B         0
sdg         0        0B       0B         0
sdh         0        0B       0B         0

All LSI-based (megaraid_sas driver) and SmartArray-based (hpsa driver) systems suffer from this problem. If you want to TRIM your SSDs, you would have to shutdown System_B, pull the drive out, connect it to a SAS/SATA-capable system, and fstrim there.

Fortunately for us, there's a small trick to temporarily expose the native capabilities of your device and TRIM it. This requires taking down the application that uses your RAID drive, but at least it does not require you to walk to a DataCenter to pull some hardware out of a system.

The trick is to stop using the RAID drive through the RAID driver, expose the SSD as a JBOD, re-mount the filesystem, and then TRIM it there. Once it DISCARDs the blocks, simply put the drive back in RAID mode, mount the filesystem, and then restart your applications.

There are a couple of caveats:

  • The RAID hardware you are using must allow devices to be put in JBOD mode.
  • You cannot do this on your boot disk as it would require taking down the OS.

Walking through the process

Here is a small walk-through created on a system with a Dell PERC H730P and a Samsung SSD. We'll call this system System_C.

1) The SSD is at [32:2] on HBA a0, and we'll create a single RAID 0 drive from it:

[root@System_C ~]# MegaCli -CfgLdAdd -r0 [32:2] WB RA CACHED -strpsz 512 -a0

2) The new logical drive pops up as /dev/sdd and shows no DISCARD capabilities:

[root@System_C ~]# lsblk -dD
NAME DISC-ALN DISC-GRAN DISC-MAX DISC-ZERO
[....]
sdd         0        0B       0B         0

3) Next, create a volume group (VG), a volume, and a 128G filesystem on top of that device:

[root@System_C ~]# parted /dev/sdd
[root@System_C ~]# pvcreate /dev/sdd1
[root@System_C ~]# vgcreate testdg /dev/sdd1
[root@System_C ~]# lvcreate -L 128G -n lv_test testdg
[root@System_C ~]# mount /dev/testdg/lv_test /mnt
[root@System_C ~]# mke2fs -t ext4 /dev/testdg/lv_test 
[root@System_C ~]# mount /dev/testdg/lv_test /mnt

For the sake of this demonstration, we'll copy some data to /mnt.

4) Stop using the system and export the volume group:

[root@System_C ~]# umount /mnt
[root@System_C ~]# vgchange -a n testdg
  0 logical volume(s) in volume group "testdg" now active
[root@System_C ~]# vgexport testdg
  Volume group "testdg" successfully exported

5) Enable JBOD mode on the HBA:

[root@System_C ~]# MegaCli -AdpSetProp -EnableJBOD -1 -a0

Adapter 0: Set JBOD to Enable success.

Exit Code: 0x00

6) Delete the logical drive and make the drive JBOD. On most RAID controllers, safety checks prevent you from creating a JBOD with a drive that is part of a logical volume:

[root@System_C ~]# MegaCli -PDMakeJBOD -PhysDrv[32:2] -a0

Adapter: 0: Failed to change PD state at EnclId-32 SlotId-2.

Exit Code: 0x01

The solution here is to delete the logical volume. This is a simple logical operation, and it will not touch our data. However, you must have written down the command used to create the RAID 0 array in the first place.

[root@System_C ~]# MegaCli -CfgLdDel -L3 -a0
                                     
Adapter 0: Deleted Virtual Drive-3(target id-3)

Exit Code: 0x00
[root@System_C ~]# MegaCli -PDMakeJBOD -PhysDrv[32:2] -a0
                                     
Adapter: 0: EnclId-32 SlotId-2 state changed to JBOD.

Exit Code: 0x00

7) Refresh the kernel's view of the disks and import your data:

[root@System_C ~]# partprobe
[root@System_C ~]# vgscan 
  Reading volume groups from cache.
  Found exported volume group "testdg" using metadata type lvm2
  Found volume group "rootdg" using metadata type lvm2

[root@System_C ~]# vgimport testdg
  Volume group "testdg" successfully imported

[root@System_C ~]# vgchange -a y testdg
  1 logical volume(s) in volume group "testdg" now active

[root@System_C ~]# mount /dev/testdg/lv_test /mnt

[root@System_C ~]# fstrim -v /mnt
/mnt: 125.5 GiB (134734139392 bytes) trimmed

We have discarded the empty blocks on our filesystem. Let's put it back in a RAID 0 logical drive.

8) umount the filesystem and export the volume group:

[root@System_C ~]# umount /mnt
[root@System_C ~]# vgchange -a n testdg
  0 logical volume(s) in volume group "testdg" now active
[root@System_C ~]# vgexport testdg
  Volume group "testdg" successfully exported

9) Disable JBOD mode on the RAID controller:

[root@System_C ~]# MegaCli -AdpSetProp -EnableJBOD -0 -a0

Adapter 0: Set JBOD to Disable success.

Exit Code: 0x00

10) Re-create your logical drive:

[root@System_C ~]# MegaCli -CfgLdAdd -r0 [32:2] WB RA CACHED -strpsz 512 -a0

11) Ask the kernel to probe the disks and re-mount your filesystem:

[root@System_C ~]# partprobe
[root@System_C ~]# vgscan 
  Reading volume groups from cache.
  Found exported volume group "testdg" using metadata type lvm2
  Found volume group "rootdg" using metadata type lvm2

[root@System_C ~]# vgimport testdg
  Volume group "testdg" successfully imported

[root@System_C ~]# vgchange -a y testdg
  1 logical volume(s) in volume group "testdg" now active

[root@System_C ~]# mount /dev/testdg/lv_test /mnt

Your data should be there, and the performance of your SSD should be back to its original figures.

[ Free online course: Red Hat Enterprise Linux technical overview. ] 

Wrapping up

Here are a few additional notes:

  • This procedure should be taken with a grain of salt and with a large warning: DO NOT perform this unless you are confident you can identify logical drives and JBODs on a Linux system.
  • I have only tested this procedure using RAID 0 logical drives. It seems unlikely that it would function for other types of RAID (5, 6, 1+0, etc.) as the structure of the filesystem will be hidden from the Linux OS.
  • Please do not perform this procedure without verified backups.

저자 소개

Vincent Cojot is a repentless geek. When he's not fixing someone else's OpenStack or RHEL environment, he can be seen trying to squeeze the most efficent use of resources out of server hardware. After a career in the banking and telco industries, he thinks there's nothing as much fun as working for an Open-Source software company. 

UI_Icon-Red_Hat-Close-A-Black-RGB

채널별 검색

automation icon

오토메이션

기술, 팀, 인프라를 위한 IT 자동화 최신 동향

AI icon

인공지능

고객이 어디서나 AI 워크로드를 실행할 수 있도록 지원하는 플랫폼 업데이트

open hybrid cloud icon

오픈 하이브리드 클라우드

하이브리드 클라우드로 더욱 유연한 미래를 구축하는 방법을 알아보세요

security icon

보안

환경과 기술 전반에 걸쳐 리스크를 감소하는 방법에 대한 최신 정보

edge icon

엣지 컴퓨팅

엣지에서의 운영을 단순화하는 플랫폼 업데이트

Infrastructure icon

인프라

세계적으로 인정받은 기업용 Linux 플랫폼에 대한 최신 정보

application development icon

애플리케이션

복잡한 애플리케이션에 대한 솔루션 더 보기

Virtualization icon

가상화

온프레미스와 클라우드 환경에서 워크로드를 유연하게 운영하기 위한 엔터프라이즈 가상화의 미래