Looking on how to recover your awx-operator helm install of ansible after upgrading from versions 12.1 and lower to 13.x+ and higher, only to find out that postgres 15 cant start and the entire service is down? Look no further.
Step 1:
First, lets make sure if anything strange happens that our postgres database is safe. Make sure that the PV (persistent volume) has the following under spec (by editing the PV yaml). This can be changed at any time without risk of downtime etc.
kubectl edit pv pvc-<your-pv-id>
spec:
persistentVolumeReclaimPolicy: Retain
Lets also create an awx database backup, this can be used later if we need to restore our database to a known good state. Create a new file, backup.yaml
apiVersion: awx.ansible.com/v1beta1
kind: AWXBackup
metadata:
name: awxbackup-0-0-1
namespace: awx
spec:
deployment_name: awxkubectl apply -f backup.yaml
Step 2:
Once the backup completes (kubectl describe awxbackup -n awx awxbackup-0-0-1
), upgrade the awx-operator helm chart to the latest version. This will break awx while we perform the next few steps, don’t panic.
helm repo update
helm upgrade awx-operator --namespace awx
Also upgrade the CRDs
kubectl apply --server-side --force-conflicts -k github.com/ansible/awx-operator/config/crd
After the upgrade process finishes, and the postgres15 pod tries (and fails) to come online, proceed to the next step
Step 3:
Now we’re going to create a temporary pod that we’ll connect to the awx postgres pvc so that we can fix the permission issues. Create a new yaml file (pvc.yaml) that we’ll use to create the temporary pod, if your pvc created by awx-operator or namespace is different, edit as needed. Even if your storage backing is ReadWriteOnce, this should still succeed as it’ll start up between postgres15 pod crashloopbackoffs.:
apiVersion: v1
kind: Pod
metadata:
name: pvc-inspector
namespace: awx
spec:
containers:
- image: busybox
name: pvc-inspector
command: ["tail"]
args: ["-f", "/dev/null"]
volumeMounts:
- mountPath: /pvc
name: pvc-mount
volumes:
- name: pvc-mount
persistentVolumeClaim:
claimName: postgres-15-awx-postgres-15-0
Step 4:
Create the pod based on the yaml file created in the previous step:
kubectl apply -f pvc.yaml
Step 5:
Exec into the temporary pod and change the permission settings for the postgres database:
kubectl exec -it -n awx pvc-inspector /bin/sh
> chown -R 26:26 /pvc/data/
Step 6:
Tear down the temporary pod:
kubectl delete pod -n awx pvc-inspector
Step 7:
Success! AWX should now be up and running on postgres15 with the latest helm chart release. The migration/upgrade process make take up to 5 minutes to complete, so please be patient.
Troubleshooting:
If for some reason its been 10+ minutes and the migration/upgrade process continues to fail, you can try to restore the database backup on top of the failed postgres15 database, assuming the postgres15 pod is online.
Create a new file, restore.yaml:
apiVersion: awx.ansible.com/v1beta1
kind: AWXRestore
metadata:
name: restore1
namespace: awx
spec:
deployment_name: awx
backup_name: awxbackup-0-0-1
And apply it kubectl apply -f restore.yaml
Emergency rescue:
If all else fails, we can restore back the old database. First, lets tear down the non-working awx instance. helm uninstall -n awx awx-operator
Now lets re-install awx: helm install -n awx awx-operator --version 2.12.2
Be sure to edit the values.yaml to setAWX:
enabled: true
Wait for awx to come fully online. With any luck it will grab the retained postgres13 database and things will be back online after several minutes. If after ~10 minutes its still not online, you can then run the awx-restore operator command.
Create a new file, restore.yaml:
apiVersion: awx.ansible.com/v1beta1
kind: AWXRestore
metadata:
name: restore1
namespace: awx
spec:
deployment_name: awx
backup_name: awxbackup-0-0-1
And apply it kubectl apply -f restore.yaml