Skip to main content
Version: 123.3

PDS maintenance impact and recommended actions

Maintenance for PDS version 123.x is scheduled on November 29, 2024, from 2:30 AM to 2:30 PM UTC (November 28, 2024, from 6:30 PM to November 29, 2024, 6:30 AM PST).

During the scheduled maintenance window, the PDS production environment will be inaccessible. While data services will continue to operate normally, PDS operators and agents within your target clusters will temporarily lose connectivity with the control plane.

Impact

  • During the maintenance period, the PDS agent pod may enter a CrashLoopBackOff state.

    Example:

    NAME                                                      READY   STATUS             RESTARTS     AGE
    pds-agent-fdd8c5cd-xsscd 0/1 CrashLoopBackOff 2 (3s ago) 7h28m
    pds-backup-controller-manager-5b45cb45d4-mmxcd 2/2 Running 0 7h28m
    pds-deployment-controller-manager-8687ff7b4f-fnvxk 2/2 Running 0 7h28m
    pds-external-dns-568bd6cfc6-ggkzv 1/1 Running 0 7h28m
    pds-mutator-65f86c5d9-m9htx 1/1 Running 0 7h28m
    pds-operator-target-controller-manager-5b658c9dc7-rt6zp 2/2 Running 0 7h28m
    pds-tc-kube-state-metrics-cb4f8dd45-n9s4x 1/1 Running 0 7h28m
    pds-tc-prometheus-server-5554598694-nxt22 2/2 Running 0 7h28m
    pds-teleport-67d887bfdf-68mfz 1/1 Running 0 7h28m
  • Once maintenance is complete, you should verify that all PDS pods and resources are running and successfully connected to the cloud service.

The temporary pod failures will not affect the uptime or functionality of your data services. However, the following recommendations will help manage any observability alerts and monitoring disruptions caused by these failure events.

Option 1: Suppress alerts and ignore pod failures

Since pod failure events may trigger false alarms in your monitoring systems, Portworx recommends pausing or configuring your monitoring and observability solutions to ignore PDS pod failures during the maintenance period. Once the maintenance concludes and the control plane is accessible, the pods should recover automatically.

Option 2: Use maintenance script to prevent Alerts

If you cannot suppress alerts in your environment, we provide a script to temporarily scale down the PDS agents and operators. This job monitors the health of the control plane and automatically brings the operators and agents back online once maintenance is complete.

  1. Download the job file from https://pds.pure-px.io/maintenance-job.yaml, or copy the YAML below:

    ---
    apiVersion: rbac.authorization.k8s.io/v1
    kind: Role
    metadata:
    namespace: pds-system
    name: pds-maintenance-role
    rules:
    - apiGroups: ["apps"] # "" indicates the core API group
    resources: ["deployments"]
    verbs: ["*"]
    ---
    kind: RoleBinding
    apiVersion: rbac.authorization.k8s.io/v1
    metadata:
    namespace: pds-system
    name: pds-maintenance-rolebinding
    subjects:
    - kind: ServiceAccount
    name: pds-maintenance-sa
    namespace: pds-system
    roleRef:
    kind: Role
    name: pds-maintenance-role
    apiGroup: rbac.authorization.k8s.io
    ---
    apiVersion: v1
    kind: ServiceAccount
    metadata:
    namespace: pds-system
    name: pds-maintenance-sa
    ---
    apiVersion: batch/v1
    kind: Job
    metadata:
    namespace: pds-system
    name: pds-maintenance-job
    spec:
    template:
    spec:
    serviceAccountName: pds-maintenance-sa
    containers:
    - name: pds-maintenance-job
    image: portworx/pds-agent:maintenance-job
    imagePullPolicy: Always
    restartPolicy: Never

  2. Save the YAML below as pds-maintenance-job.yaml and apply it to each cluster that you have onboarded to PDS.

  3. (Optional) If using a custom registry, pull the portworx/pds-agent:maintenance-job image to avoid interruptions.

Post-maintenance recovery

If the Portworx Agent and/or Operator encounters issues such as being in a crash or failed state, then Portworx recommends restarting the pods to restore them to a healthy state using the following command:

kubectl rollout restart deploy -n pds-system

This action will restore connectivity with the control plane and ensure your agents and operators are functioning normally.

This maintenance process will ensure that any disruptions caused by temporary control plane unavailability are mitigated, allowing for seamless recovery and minimizing false alarms in your observability systems. However, if you experience any issues beyond the maintenance window, please contact Portworx support for assistance and resolution.

Was this page helpful?