Configure stork-controller-config ConfigMap parameters
The cluster configuration with large number of Kubernetes resources can be spread across a broad spectrum of resource and system configurations. To make the solution viable to fit a wide range of configurations, users can alter the ConfigMap parameters.
Add the parameters specified in the table below in stork-controller-config ConfigMap in kube-system namespace and alter the values as required to suit your configuration:
| Key/Parameter | Default Value | Description |
|---|---|---|
large-resource-size-limit | 1 MB | Sets the size limit to adapt to the cluster-wide setting of etcd's message size if your cluster has modified the default etcd message size of 1.5 MB. The default value of 1 MB is derived by subtracting approximately 500 KB (reserved for etcd headers and overhead) from the default etcd size limit of 1.5 MB. If your cluster's etcd size limit has been modified, adjust this value by subtracting approximately 500 KB from your cluster's configured etcd size limit. |
resource-count-limit | 500 | Sets the maximum number of resources that will be grouped together for upload, regardless of whether the size limit is reached. Use this parameter to reduce the number of Kubernetes API calls by limiting uploads to a single large resource group. |
restore-volume-backup-count | 25 | Sets the number of volumes that will be restored in a single batch. If the restore process fails with a "device busy" error, reduce this value below 25. |
restore-volume-sleep-interval | 20s | Sets the time interval between two batches of volumes that will be restored. Increase this value to allow more time for the backend storage to process each batch before the next one begins. |
The behavior of these parameters is explained below:
-
Large-resource-size-limit: In a cluster, if the etcd‘s message size is configured lesser than the default value of 1.5 MB, then you should alter this parameter's value to adapt to its cluster-wide settings. Users can specify an appropriate value (in bytes) to update the value of this parameter.
-
Resource-count-limit: If the number of resources overload the Kubernetes API server, then you may see the following error in stork log and eventually the backup operation can time out:
time="2023-04-22T04:22:49Z" level=debug msg="Monitoring storage nodes"
time="2023-04-22T04:23:55Z" level=warning msg="gatherResourceInChunks: failed to list resources"
time="2023-04-22T04:23:55Z" level=error msg="Error getting resources: the server was unable to return a response in the time allotted, but may still be processing the request" ApplicationBackupName=<application-backup-name> ApplicationBackupUID=<application-backup-uid> Namespace=<namespace-name> ResourceVersion=<resource-version>
time="2023-04-22T04:23:55Z" level=error msg="Error backing up resources: the server was unable to return a response in the time allotted, but may still be processing the request" ApplicationBackupName=<application-backup-name> ApplicationBackupUID=<application-backup-uid> Namespace=<namespace-name> ResourceVersion=<resource-version>
time="2023-04-22T04:23:55Z" level=error msg="Error backing up volumes: the server was unable to return a response in the time allotted, but may still be processing the request" ApplicationBackupName=<application-backup-name> ApplicationBackupUID=<application-backup-uid> Namespace=<namespace-name> ResourceVersion=resource-version>To troubleshoot this scenario, you can change the default value of 500 resource queries at a time to a lesser number, say 200 or 300.
-
Restore-volume-backup-count: This configuration parameter defines the number of volumes that will be restored in a single batch. Whenever the restore process fails with device busy error, then one of the probable errors can be higher batch count of PVCs supplied for the restore process. Hence, the backend storage system fails with device busy error. Here is the sample error message displayed in the web console window for this scenario:
Restore failed for volume: cloudsnap Restore id:<restore_id> for <backup-name> did not succeed: [createRestoreDestinationVol, Failed to create restore vol err:Volume (Name: <pvc-name>)] create failed error: Volume is busy on Node-not-assigned, processingNode <node-name>]Alter the default value of this parameter to a value below 25 as a troubleshooting measure.
-
Restore-volume-sleep-interval: This parameter helps you to increase the time interval between two batches of volumes that will be restored. You can increase the default value to increase the interval between two batches of restore.
Large resource NFS backups and restores
KDMP job pods consume increased amounts of memory for large resource backup and restore operations to NFS backup locations. As a result, you may see out of memory alerts or a failure of the NFS job pods that run on each target cluster. In these scenarios, Portworx by Everpure recommends increasing the CPU and memory limit by adding the following parameters to the kdmp-config ConfigMap which resides in the kube-system namespace on the target cluster:
The following parameters in the kdmp-config ConfigMap allow you to increase resources for the NFS executor:
KDMP_NFSEXECUTOR_REQUEST_CPUKDMP_NFSEXECUTOR_LIMIT_CPUKDMP_NFSEXECUTOR_REQUEST_MEMORYKDMP_NFSEXECUTOR_LIMIT_MEMORY
For more information on these parameters and how to configure them, see kdmp-config ConfigMap parameters.
Note that these values are not displayed in the kdmp-config ConfigMap by default. When you edit ConfigMap with kubectl command, you can refer usage column in the above table to set the parameters. In this case values of these two parameters are modified to double of the default value of CPU and Memory limit.
For example, consider a cluster with 4 nodes and 50,000 resources composed of ConfigMap and secret resource types. Maximum required memory limit (KDMP_NFSEXECUTOR_LIMIT_MEMORY) to back up and restore data in such an environment is approximately 3Gi. Please note that values provide approximate value for required memory, actual memory may vary depending on your environment and configuration.