PDS maintenance impact and recommended actions
Maintenance for PDS version 123.x is scheduled on November 29, 2024, from 2:30 AM to 2:30 PM UTC (November 28, 2024, from 6:30 PM to November 29, 2024, 6:30 AM PST).
During the scheduled maintenance window, the PDS production environment will be inaccessible. While data services will continue to operate normally, PDS operators and agents within your target clusters will temporarily lose connectivity with the control plane.
Impact
-
During the maintenance period, the PDS agent pod may enter a
CrashLoopBackOff
state.Example:
NAME READY STATUS RESTARTS AGE
pds-agent-fdd8c5cd-xsscd 0/1 CrashLoopBackOff 2 (3s ago) 7h28m
pds-backup-controller-manager-5b45cb45d4-mmxcd 2/2 Running 0 7h28m
pds-deployment-controller-manager-8687ff7b4f-fnvxk 2/2 Running 0 7h28m
pds-external-dns-568bd6cfc6-ggkzv 1/1 Running 0 7h28m
pds-mutator-65f86c5d9-m9htx 1/1 Running 0 7h28m
pds-operator-target-controller-manager-5b658c9dc7-rt6zp 2/2 Running 0 7h28m
pds-tc-kube-state-metrics-cb4f8dd45-n9s4x 1/1 Running 0 7h28m
pds-tc-prometheus-server-5554598694-nxt22 2/2 Running 0 7h28m
pds-teleport-67d887bfdf-68mfz 1/1 Running 0 7h28m -
Once maintenance is complete, all PDS pods and resources should return to a healthy state and successfully reconnect to the cloud service. If issues persist, refer to the post-maintenance recovery steps for guidance.
Recommended actions
The temporary pod failures will not affect the uptime or functionality of your data services. However, the following recommendations will help manage any observability alerts and monitoring disruptions caused by these failure events.
Option 1: Suppress alerts and ignore pod failures
Since pod failure events may trigger false alarms in your monitoring systems, Portworx recommends pausing or configuring your monitoring and observability solutions to ignore PDS pod failures during the maintenance period. Once the maintenance concludes and the control plane is accessible, the pods should recover automatically.
Option 2: Use maintenance script to prevent Alerts
If you cannot suppress alerts in your environment, we provide a script to temporarily scale down the PDS agents and operators. This job monitors the health of the control plane and automatically brings the operators and agents back online once maintenance is complete.
-
Download the job file from https://pds.pure-px.io/maintenance-job.yaml, or copy the YAML below:
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
namespace: pds-system
name: pds-maintenance-role
rules:
- apiGroups: ["apps"] # "" indicates the core API group
resources: ["deployments"]
verbs: ["*"]
---
kind: RoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
namespace: pds-system
name: pds-maintenance-rolebinding
subjects:
- kind: ServiceAccount
name: pds-maintenance-sa
namespace: pds-system
roleRef:
kind: Role
name: pds-maintenance-role
apiGroup: rbac.authorization.k8s.io
---
apiVersion: v1
kind: ServiceAccount
metadata:
namespace: pds-system
name: pds-maintenance-sa
---
apiVersion: batch/v1
kind: Job
metadata:
namespace: pds-system
name: pds-maintenance-job
spec:
template:
spec:
serviceAccountName: pds-maintenance-sa
containers:
- name: pds-maintenance-job
image: portworx/pds-agent:maintenance-job
imagePullPolicy: Always
restartPolicy: Never -
Save the YAML below as
pds-maintenance-job.yaml
and apply it to each cluster that you have onboarded to PDS. -
(Optional) If using a custom registry, pull the
portworx/pds-agent:maintenance-job
image to avoid interruptions.
Post-maintenance recovery
If the Portworx Agent and/or Operator encounters issues such as being in a crash or failed state, then Portworx recommends restarting the pods to restore them to a healthy state using the following command:
kubectl rollout restart deploy -n pds-system
This action will restore connectivity with the control plane and ensure your agents and operators are functioning normally.
This maintenance process will ensure that any disruptions caused by temporary control plane unavailability are mitigated, allowing for seamless recovery and minimizing false alarms in your observability systems. However, if you experience any issues beyond the maintenance window, please contact Portworx support for assistance and resolution.