The purpose of this article is to describe a solution to an issue that you may have faced when using Red Hat OpenShift Data Foundation in a cloud environment. Specifically, we’re looking at how to address the demand for more resources, more nodes and more Object Storage Devices (OSDs) as an OpenShift Data Foundation deployment matures.
In this article you’ll find a step-by-step procedure in order to migrate the data from the existing OSDs to new ones with a bigger size, in order to manage more data with the same resources. The procedure avoids data loss, and OpenShift Data Foundation will migrate all the data for you with two simple logical steps:
- Add a new StorageDeviceSet to the StorageCluster
- Remove one by one the old OSDs and the old StorageDeviceSet
Important: Before implementing the procedure, specially on production environment, it’s warmly suggested to open a support case in order to let the support team know about the activity and to let them check the environment so that you can proceed more safely
Let’s go with the details!
Backup of StorageCluster CR
$ oc project openshift-storage
Already on project "openshift-storage" on server "https://api.ocpcluster.example.com:6443".
$ oc get storagecluster ocs-storagecluster -o yaml > ocs-storagecluster.yaml
Edit StorageCluster CR
Add a new storageDeviceSet object containing the new disk configuration, in this case the size will be 2Ti. Here the configuration that have to be added under the spec property:
storageDeviceSets:
- config: {}
count: 1
dataPVCTemplate:
metadata: {}
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 2Ti
storageClassName: managed-csi-azkey
volumeMode: Block
status: {}
name: ocs-deviceset-large
placement: {}
preparePlacement: {}
replica: 3
resources:
limits:
cpu: "2"
memory: 5Gi
requests:
cpu: "2"
memory: 5Gi
Wait for new PVCs creation
$ oc get pvc
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
db-noobaa-db-pg-0 Bound pvc-ebd5ef7a-e802-4973-82be-b26fe7af973c 50Gi RWO ocs-storagecluster-ceph-rbd 127d
ocs-deviceset-large-0-data-04b54w Bound pvc-ed847deb-d716-4d66-83b0-d4c614ad3f55 2Ti RWO managed-csi-azkey 74s
ocs-deviceset-large-1-data-0q6mt5 Bound pvc-b4af134b-8f7b-4d50-a0c0-b2f68068b313 2Ti RWO managed-csi-azkey 74s
ocs-deviceset-large-2-data-0b2p6b Bound pvc-1c07d21b-d81e-4755-a05b-22947b3b67e1 2Ti RWO managed-csi-azkey 74s
ocs-deviceset-small-0-data-025w8j Bound pvc-ef3dfb24-ff39-441a-bf41-3d700efe94d4 500Gi RWO managed-csi-azkey 17h
ocs-deviceset-small-1-data-08r9h5 Bound pvc-5121174f-13d1-42d0-a24a-562978d151b4 500Gi RWO managed-csi-azkey 17h
ocs-deviceset-small-2-data-0czk2s Bound pvc-6d0e8bb8-b999-4367-a65d-bfde4c1c043b 500Gi RWO managed-csi-azkey 17h
Check that new OSDs have been created
If you don’t have the rook-ceph-tools pod enabled, you can activate it by following the article: https://access.redhat.com/articles/4628891
$ oc -n openshift-storage rsh $(oc get pods -n openshift-storage -l app=rook-ceph-tools -o name)
sh-4.4$ ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 7.46489 root default
-5 7.46489 region northeurope
-10 2.48830 zone northeurope-1
-9 2.48830 host ocpcluster-kl7ds-ocs-northeurope1-zzhql
2 hdd 2.00000 osd.2 up 1.00000 1.00000
4 hdd 0.48830 osd.4 up 1.00000 1.00000
-14 2.48830 zone northeurope-2
-13 2.48830 host ocpcluster-kl7ds-ocs-northeurope2-4b6wx
0 hdd 2.00000 osd.0 up 1.00000 1.00000
5 hdd 0.48830 osd.5 up 1.00000 1.00000
-4 2.48830 zone northeurope-3
-3 2.48830 host ocpcluster-kl7ds-ocs-northeurope3-4gzb5
1 hdd 2.00000 osd.1 up 1.00000 1.00000
3 hdd 0.48830 osd.3 up 1.00000 1.00000
Wait for data rebalance to be completed
The output of ceph status command has to be HEALTH_OK and all pgs have to be in active+clean state.
Before:
$ oc -n openshift-storage rsh $(oc get pods -n openshift-storage -l app=rook-ceph-tools -o name)
sh-4.4$ ceph status
cluster:
id: e2c7dfa8-fa8b-4ba7-a3f6-b22e2d4d410f
health: HEALTH_OK
services:
mon: 3 daemons, quorum b,c,d (age 2d)
mgr: a(active, since 2d)
mds: 1/1 daemons up, 1 hot standby
osd: 6 osds: 6 up (since 2m), 6 in (since 3m); 125 remapped pgs
data:
volumes: 1/1 healthy
pools: 4 pools, 193 pgs
objects: 86.97k objects, 332 GiB
usage: 999 GiB used, 6.5 TiB / 7.5 TiB avail
pgs: 206865/260913 objects misplaced (79.285%)
124 active+remapped+backfill_wait
68 active+clean
1 active+remapped+backfilling
io:
client: 1.9 KiB/s rd, 299 KiB/s wr, 2 op/s rd, 8 op/s wr
recovery: 23 MiB/s, 5 objects/s
After:
$ oc -n openshift-storage rsh $(oc get pods -n openshift-storage -l app=rook-ceph-tools -o name)
sh-4.4$ ceph status
cluster:
id: e2c7dfa8-fa8b-4ba7-a3f6-b22e2d4d410f
health: HEALTH_OK
services:
mon: 3 daemons, quorum b,c,d (age 2d)
mgr: a(active, since 2d)
mds: 1/1 daemons up, 1 hot standby
osd: 6 osds: 6 up (since 69m), 6 in (since 70m)
data:
volumes: 1/1 healthy
pools: 4 pools, 193 pgs
objects: 87.17k objects, 333 GiB
usage: 1021 GiB used, 6.5 TiB / 7.5 TiB avail
pgs: 193 active+clean
io:
client: 2.2 KiB/s rd, 488 KiB/s wr, 2 op/s rd, 6 op/s wr
Remove old OSDs
This step is based on the solution https://access.redhat.com/solutions/5015451
Scale to zero ocs-operator and rook-ceph-operator deployments
$ oc scale deploy ocs-operator --replicas 0
deployment.apps/ocs-operator scaled
$ oc scale deploy rook-ceph-operator --replicas 0
deployment.apps/rook-ceph-operator scaled
$ oc get deploy ocs-operator rook-ceph-operator
NAME READY UP-TO-DATE AVAILABLE AGE
ocs-operator 0/0 0 0 128d
rook-ceph-operator 0/0 0 0 128d
Get the osd.id of all the OSDs that are going to be removed
In this case osd.3 osd.4 and osd.5:
$ oc -n openshift-storage rsh $(oc get pods -n openshift-storage -l app=rook-ceph-tools -o name)
sh-4.4$ ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 7.46489 root default
-5 7.46489 region northeurope
-10 2.48830 zone northeurope-1
-9 2.48830 host ocpcluster-kl7ds-ocs-northeurope1-zzhql
2 hdd 2.00000 osd.2 up 1.00000 1.00000
4 hdd 0.48830 osd.4 up 1.00000 1.00000
-14 2.48830 zone northeurope-2
-13 2.48830 host ocpcluster-kl7ds-ocs-northeurope2-4b6wx
0 hdd 2.00000 osd.0 up 1.00000 1.00000
5 hdd 0.48830 osd.5 up 1.00000 1.00000
-4 2.48830 zone northeurope-3
-3 2.48830 host ocpcluster-kl7ds-ocs-northeurope3-4gzb5
1 hdd 2.00000 osd.1 up 1.00000 1.00000
3 hdd 0.48830 osd.3 up 1.00000 1.00000
Important: Execute the following steps in serial mode one OSD a time for each OSD to remove, waiting for the data rebalance to be terminated after each OSD removal, in order to avoid potential data loss
Scale to zero of the osd.id deployment to remove
In this case the first one will be osd.3:
$ oc scale deploy rook-ceph-osd-3 --replicas 0
deployment.apps/rook-ceph-osd-3 scaled
$ oc get deploy rook-ceph-osd-3
NAME READY UP-TO-DATE AVAILABLE AGE
rook-ceph-osd-3 0/0 0 0 19h
Remove the OSD
The failed_osd_id variable must contain the ID of the OSD to remove, in this case respectively 3, 4, and 5:
$ failed_osd_id=3
$ oc process -n openshift-storage ocs-osd-removal -p FORCE_OSD_REMOVAL=true -p FAILED_OSD_IDS=${failed_osd_id} | oc create -f -
job.batch/ocs-osd-removal-job created
Wait for the job completion
Check the log of the newly created pod and look for the "completed removal" message:
$ oc get jobs
NAME COMPLETIONS DURATION AGE
ocs-osd-removal-job 1/1 13s 45s
rook-ceph-osd-prepare-2abe011277f790a287a5a129e960558c 1/1 32s 85m
rook-ceph-osd-prepare-a604505c4d1ba7640d40e4553f495658 1/1 29s 85m
rook-ceph-osd-prepare-dac4a35f2d709d73b7af34935b4fd19b 1/1 30s 85m
rook-ceph-osd-prepare-e0b2c88b9729e8cccd0f64c3bfa09dbb 1/1 31s 19h
rook-ceph-osd-prepare-e81129ea7423d35d417a8675f58f8d1c 1/1 30s 19h
$ oc get pod | grep ocs-osd-removal-job
ocs-osd-removal-job-mswng 0/1 Completed 0 56s
$ oc logs ocs-osd-removal-job-mswng | tail -2
2023-12-14 10:53:15.403183 I | cephosd: no ceph crash to silence
2023-12-14 10:53:15.403231 I | cephosd: completed removal of OSD 3
Remove the job
$ oc delete job ocs-osd-removal-job
job.batch "ocs-osd-removal-job" deleted
Wait for data rebalance to be completed
The output of ceph status command has to be HEALTH_OK and all pgs have to be in active+clean state.
Before:
$ oc -n openshift-storage rsh $(oc get pods -n openshift-storage -l app=rook-ceph-tools -o name)
sh-4.4$ ceph status
cluster:
id: e2c7dfa8-fa8b-4ba7-a3f6-b22e2d4d410f
health: HEALTH_WARN
Degraded data redundancy: 12672/261621 objects degraded (4.844%), 24 pgs degraded, 24 pgs undersized
services:
mon: 3 daemons, quorum b,c,d (age 2d)
mgr: a(active, since 2d)
mds: 1/1 daemons up, 1 hot standby
osd: 5 osds: 5 up (since 3m), 5 in (since 2m); 28 remapped pgs
data:
volumes: 1/1 healthy
pools: 4 pools, 193 pgs
objects: 87.21k objects, 333 GiB
usage: 958 GiB used, 6.0 TiB / 7.0 TiB avail
pgs: 12672/261621 objects degraded (4.844%)
2081/261621 objects misplaced (0.795%)
165 active+clean
24 active+undersized+degraded+remapped+backfilling
4 active+remapped+backfilling
io:
client: 852 B/s rd, 99 KiB/s wr, 1 op/s rd, 9 op/s wr
recovery: 147 MiB/s, 37 objects/s
After:
$ oc -n openshift-storage rsh $(oc get pods -n openshift-storage -l app=rook-ceph-tools -o name)
sh-4.4$ ceph status
cluster:
id: e2c7dfa8-fa8b-4ba7-a3f6-b22e2d4d410f
health: HEALTH_OK
services:
mon: 3 daemons, quorum b,c,d (age 2d)
mgr: a(active, since 2d)
mds: 1/1 daemons up, 1 hot standby
osd: 5 osds: 5 up (since 19m), 5 in (since 17m)
data:
volumes: 1/1 healthy
pools: 4 pools, 193 pgs
objects: 87.56k objects, 334 GiB
usage: 1004 GiB used, 6.0 TiB / 7.0 TiB avail
pgs: 193 active+clean
io:
client: 852 B/s rd, 71 KiB/s wr, 1 op/s rd, 6 op/s wr
NOTE: Repeat steps from 6.3 to 6.7 for each OSD to remove
Remove the old storageDeviceSet from the storageCluster CR
Edit the storageCluster CR with the command oc edit storagecluster ocs-storagecluster and remove the section related to the old storageDeviceSet, in this case that will be the one with the 500Gi disk size:
- config: {}
count: 1
dataPVCTemplate:
metadata: {}
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 500Gi
storageClassName: managed-csi-azkey
volumeMode: Block
status: {}
name: ocs-deviceset-small
placement: {}
preparePlacement: {}
replica: 3
resources:
limits:
cpu: "2"
memory: 5Gi
requests:
cpu: "2"
memory: 5Gi
Scale to replica 1 the ocs-operator deployment
$ oc scale deploy ocs-operator --replicas 1
deployment.apps/ocs-operator scaled
$ oc get deploy
NAME READY UP-TO-DATE AVAILABLE AGE
csi-addons-controller-manager 1/1 1 1 128d
csi-cephfsplugin-provisioner 2/2 2 2 128d
csi-rbdplugin-provisioner 2/2 2 2 128d
noobaa-endpoint 1/1 1 1 128d
noobaa-operator 1/1 1 1 128d
ocs-metrics-exporter 1/1 1 1 128d
ocs-operator 1/1 1 1 128d
odf-console 1/1 1 1 128d
odf-operator-controller-manager 1/1 1 1 128d
rook-ceph-crashcollector-2ab761a21d224ffa17656fcbf9ca40b7 1/1 1 1 19d
rook-ceph-crashcollector-58b62ca45efa9920a18db0e7f340975a 1/1 1 1 19d
rook-ceph-crashcollector-812474f5d99299c4d9485f0394522c7c 1/1 1 1 19d
rook-ceph-mds-ocs-storagecluster-cephfilesystem-a 1/1 1 1 128d
rook-ceph-mds-ocs-storagecluster-cephfilesystem-b 1/1 1 1 128d
rook-ceph-mgr-a 1/1 1 1 128d
rook-ceph-mon-b 1/1 1 1 128d
rook-ceph-mon-c 1/1 1 1 128d
rook-ceph-mon-d 1/1 1 1 78d
rook-ceph-operator 1/1 1 1 128d
rook-ceph-osd-0 1/1 1 1 153m
rook-ceph-osd-1 1/1 1 1 153m
rook-ceph-osd-2 1/1 1 1 153m
rook-ceph-tools 1/1 1 1 114d
Final check
Check for the old OSDs and PVCs removal:
$ oc get pvc
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
db-noobaa-db-pg-0 Bound pvc-ebd5ef7a-e802-4973-82be-b26fe7af973c 50Gi RWO ocs-storagecluster-ceph-rbd 128d
ocs-deviceset-large-0-data-04b54w Bound pvc-ed847deb-d716-4d66-83b0-d4c614ad3f55 2Ti RWO managed-csi-azkey 154m
ocs-deviceset-large-1-data-0q6mt5 Bound pvc-b4af134b-8f7b-4d50-a0c0-b2f68068b313 2Ti RWO managed-csi-azkey 154m
ocs-deviceset-large-2-data-0b2p6b Bound pvc-1c07d21b-d81e-4755-a05b-22947b3b67e1 2Ti RWO managed-csi-azkey 154m
$ oc -n openshift-storage rsh $(oc get pods -n openshift-storage -l app=rook-ceph-tools -o name)
sh-4.4$ ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 6.00000 root default
-5 6.00000 region northeurope
-10 2.00000 zone northeurope-1
-9 2.00000 host ocpcluster-kl7ds-ocs-northeurope1-zzhql
2 hdd 2.00000 osd.2 up 1.00000 1.00000
-14 2.00000 zone northeurope-2
-13 2.00000 host ocpcluster-kl7ds-ocs-northeurope2-4b6wx
0 hdd 2.00000 osd.0 up 1.00000 1.00000
-4 2.00000 zone northeurope-3
-3 2.00000 host ocpcluster-kl7ds-ocs-northeurope3-4gzb5
1 hdd 2.00000 osd.1 up 1.00000 1.00000
$ oc -n openshift-storage rsh $(oc get pods -n openshift-storage -l app=rook-ceph-tools -o name)
sh-4.4$ ceph status
cluster:
id: e2c7dfa8-fa8b-4ba7-a3f6-b22e2d4d410f
health: HEALTH_OK
services:
mon: 3 daemons, quorum b,c,d (age 2d)
mgr: a(active, since 2d)
mds: 1/1 daemons up, 1 hot standby
osd: 3 osds: 3 up (since 28m), 3 in (since 27m)
data:
volumes: 1/1 healthy
pools: 4 pools, 193 pgs
objects: 87.72k objects, 335 GiB
usage: 1004 GiB used, 5.0 TiB / 6 TiB avail
pgs: 193 active+clean
io:
client: 852 B/s rd, 71 KiB/s wr, 1 op/s rd, 5 op/s wr
About the author
Browse by channel
Automation
The latest on IT automation for tech, teams, and environments
Artificial intelligence
Updates on the platforms that free customers to run AI workloads anywhere
Open hybrid cloud
Explore how we build a more flexible future with hybrid cloud
Security
The latest on how we reduce risks across environments and technologies
Edge computing
Updates on the platforms that simplify operations at the edge
Infrastructure
The latest on the world’s leading enterprise Linux platform
Applications
Inside our solutions to the toughest application challenges
Original shows
Entertaining stories from the makers and leaders in enterprise tech
Products
- Red Hat Enterprise Linux
- Red Hat OpenShift
- Red Hat Ansible Automation Platform
- Cloud services
- See all products
Tools
- Training and certification
- My account
- Customer support
- Developer resources
- Find a partner
- Red Hat Ecosystem Catalog
- Red Hat value calculator
- Documentation
Try, buy, & sell
Communicate
About Red Hat
We’re the world’s leading provider of enterprise open source solutions—including Linux, cloud, container, and Kubernetes. We deliver hardened solutions that make it easier for enterprises to work across platforms and environments, from the core datacenter to the network edge.
Select a language
Red Hat legal and privacy links
- About Red Hat
- Jobs
- Events
- Locations
- Contact Red Hat
- Red Hat Blog
- Diversity, equity, and inclusion
- Cool Stuff Store
- Red Hat Summit