Version: 123.3

PDS maintenance impact and recommended actions

Maintenance for PDS version 123.x is scheduled on November 29, 2024, from 2:30 AM to 2:30 PM UTC (November 28, 2024, from 6:30 PM to November 29, 2024, 6:30 AM PST).

During the scheduled maintenance window, the PDS production environment will be inaccessible. While data services will continue to operate normally, PDS operators and agents within your target clusters will temporarily lose connectivity with the control plane.

Impact

During the maintenance period, the PDS agent pod may enter a CrashLoopBackOff state.

Example:

NAME                                                      READY   STATUS             RESTARTS     AGE
pds-agent-fdd8c5cd-xsscd                                  0/1     CrashLoopBackOff   2 (3s ago)   7h28m
pds-backup-controller-manager-5b45cb45d4-mmxcd            2/2     Running            0            7h28m
pds-deployment-controller-manager-8687ff7b4f-fnvxk        2/2     Running            0            7h28m
pds-external-dns-568bd6cfc6-ggkzv                         1/1     Running            0            7h28m
pds-mutator-65f86c5d9-m9htx                               1/1     Running            0            7h28m
pds-operator-target-controller-manager-5b658c9dc7-rt6zp   2/2     Running            0            7h28m
pds-tc-kube-state-metrics-cb4f8dd45-n9s4x                 1/1     Running            0            7h28m
pds-tc-prometheus-server-5554598694-nxt22                 2/2     Running            0            7h28m
pds-teleport-67d887bfdf-68mfz                             1/1     Running            0            7h28m

Once maintenance is complete, all PDS pods and resources should return to a healthy state and successfully reconnect to the cloud service. If issues persist, refer to the post-maintenance recovery steps for guidance.

Recommended actions

The temporary pod failures will not affect the uptime or functionality of your data services. However, the following recommendations will help manage any observability alerts and monitoring disruptions caused by these failure events.

Option 1: Suppress alerts and ignore pod failures

Since pod failure events may trigger false alarms in your monitoring systems, Portworx recommends pausing or configuring your monitoring and observability solutions to ignore PDS pod failures during the maintenance period. Once the maintenance concludes and the control plane is accessible, the pods should recover automatically.

Option 2: Use maintenance script to prevent Alerts

If you cannot suppress alerts in your environment, we provide a script to temporarily scale down the PDS agents and operators. This job monitors the health of the control plane and automatically brings the operators and agents back online once maintenance is complete.

Download the job file from https://pds.pure-px.io/maintenance-job.yaml, or copy the YAML below:

---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  namespace: pds-system
  name: pds-maintenance-role
rules:
- apiGroups: ["apps"] # "" indicates the core API group
  resources: ["deployments"]
  verbs: ["*"]
---
kind: RoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  namespace: pds-system
  name: pds-maintenance-rolebinding
subjects:
- kind: ServiceAccount
  name: pds-maintenance-sa
  namespace: pds-system
roleRef:
  kind: Role
  name: pds-maintenance-role
  apiGroup: rbac.authorization.k8s.io
---
apiVersion: v1
kind: ServiceAccount
metadata:
namespace: pds-system
name:  pds-maintenance-sa
---
apiVersion: batch/v1
kind: Job
metadata:
  namespace: pds-system
  name: pds-maintenance-job
spec:
  template:
    spec:
      serviceAccountName: pds-maintenance-sa
      containers:
      - name:  pds-maintenance-job
        image: portworx/pds-agent:maintenance-job
        imagePullPolicy: Always
      restartPolicy: Never

Save the YAML below as pds-maintenance-job.yaml and apply it to each cluster that you have onboarded to PDS.
(Optional) If using a custom registry, pull the portworx/pds-agent:maintenance-job image to avoid interruptions.

Post-maintenance recovery

If the Portworx Agent and/or Operator encounters issues such as being in a crash or failed state, then Portworx recommends restarting the pods to restore them to a healthy state using the following command:

kubectl rollout restart deploy -n pds-system

This action will restore connectivity with the control plane and ensure your agents and operators are functioning normally.

This maintenance process will ensure that any disruptions caused by temporary control plane unavailability are mitigated, allowing for seamless recovery and minimizing false alarms in your observability systems. However, if you experience any issues beyond the maintenance window, please contact Portworx support for assistance and resolution.

Impact​

Recommended actions​

Option 1: Suppress alerts and ignore pod failures​

Option 2: Use maintenance script to prevent Alerts​

Post-maintenance recovery​

Impact

Recommended actions

Option 1: Suppress alerts and ignore pod failures

Option 2: Use maintenance script to prevent Alerts

Post-maintenance recovery