Remove or replace a failed drive
When a drive fails, Portworx continues to operate using available replicas on other nodes. To fully recover from a drive failure, you must replace or remove the failed drive from Portworx, and how you recover depends on the pool which contains the failed drive.
There are two possible scenarios for a failed drive:
- The drive belongs to a pool which hosts Portworx metadata
- The drive belongs to a non-metadata pool
- You cannot recover a completely failed drive from a RAID 0 drive.
- You can only recover volumes with a replication factor greater than 1. Refer to the updating volumes page for information on increasing the replication factor of a volume.
Perform the procedures below to determine if your drive failed on a storage pool containing metadata and remove or replace a failed drive:
Determine if the pool containing your failed drive hosts Portworx metadata
You must determine if your failed drive belongs to a storage pool containing metadata to choose the appropriate method for replacing it.
Identify a failed drive and
sshinto the node containing that drive.
Determine if the pool containing that drive hosts metadata by entering the
pxctl service pool showcommand and looking for the
Has metadatafield in the output:
pxctl service pool show
PX drive configuration: Pool ID: 1 UUID: 86d5d105-2eff-4dfb-b842-eca86906c921 IO Priority: LOW Labels: iopriority=LOW,medium=STORAGE_MEDIUM_MAGNETIC Size: 20 GiB Status: Online Has metadata: Yes Drives: 1: /dev/sdc, 8.0 GiB allocated of 20 GiB, Online Cache Drives:
In the output above, note the line that reads “Has metadata: Yes”.
If your service pool has metadata, follow the procedure to Replace a drive that belongs to a pool which hosts Portworx metadata. If your service pool does not have metadata, follow the procedure to Replace a drive that belongs to a non-metadata pool.
Remove or replace a drive that belongs to a pool which hosts Portworx metadata
If the drive belongs to a pool that hosts Portworx metadata, then you must remove the node from the cluster and remove or replace the failed drive.
Perform the following steps to remove or replace the failed drive:
Use the Node decommission workflow to remove a node from the cluster. This reduces the HA level for all replicated volumes which reside on the node and restores the volumes if enough storage nodes are available in the cluster.
Remove or replace the failed drive and add the node back into the cluster. Note that Portworx runs as a
DaemonSetin Kubernetes, so when you add a node or a worker to your Kubernetes cluster, you don’t need to explicitly run Portworx on it.
Remove or replace a drive that belongs to a non-metadata pool
If the drive belongs to a non-metadata pool, then you must delete the affected storage pool.
Perform the following steps to remove or replace a failed drive by deleting the storage pool:
pxctl service pool showcommand and make a note of both the UID and pool ID for the next steps:
pxctl service pool show
PX drive configuration: Pool ID: 0 UUID: 63e528b8-bc2e-484e-a01e-91e5108ebba5 IO Priority: LOW Labels: iopriority=LOW,medium=STORAGE_MEDIUM_MAGNETIC Size: 128 GiB Status: Online Has metadata: Yes Drives: 1: /dev/sdc, 38 GiB allocated of 128 GiB, Online Cache Drives: No Cache drives found in this pool
pxctl volume listcommand with the
--pool-uidoption and the pool UID you got from the step above to list the volumes. Look through the output to find the replica sets that match the pool identifier containing the failed drive.
pxctl volume list --pool-uid 63e528b8-bc2e-484e-a01e-91e5108ebba5
ID NAME SIZE HA SHARED ENCRYPTED IO_PRIORITY STATUS SNAP-ENABLED 184362890321734385 exampleVolume 128 GiB 3 no no LOW up - detached no
Manually remove all replicas of the volumes in this pool. Refer to the Decreasing the replication factor section for information on how to do this.
Run the following command to enter the pool maintenance mode:
pxctl service pool maintenance --enter
To delete the pool, use the
pxctl service pool deletecommand and specify the
Pool IDthat you noted in step 1 as the argument:
pxctl service pool delete 0
Run the following command to exit the pool maintenance mode:
pxctl service pool maintenance --exit
Optionally, replace the failed drive, ensuring that it’s the same capacity and type as the failed one. Note that you can also reform the pool without the failed drive, resulting in a functional pool with one less drive.
Re-add the drives by entering the
pxctl service drive addcommand, specifying the
--driveoption with the paths to your drives. The following command adds three drives:
pxctl service drive add --drive /dev/sdc /dev/sdd /dev/sde