Troubleshooting

Terraform State Lock

Symptom: Error acquiring the state lock

# Check who holds the lock
aws dynamodb scan --table-name <lock-table>

# Force unlock (use with caution)
tofu force-unlock <lock-id>

EFS Mount Failures

Symptom: Pods stuck in ContainerCreating with EFS mount errors

  1. Verify EFS CSI driver is installed:

    kubectl get pods -n kube-system | grep efs
    
  2. Check security groups allow NFS (port 2049) from EKS nodes

  3. Verify EFS mount targets exist in the correct subnets

PVC Stuck in Pending

Symptom: PersistentVolumeClaim is not bound

kubectl describe pvc <pvc-name>

Common causes:

  • StorageClass not found: ensure efsStorageClass.enabled: true

  • EFS CSI driver not running

  • Wrong fileSystemId in StorageClass

Image Pull Errors

Symptom: ImagePullBackOff or ErrImagePull

For ECR:

  1. Verify node IAM role has AmazonEC2ContainerRegistryReadOnly

  2. Check images exist (repo names use the Terraform prefix, e.g. ubtrace-production/ub-backend):

    # List all ECR repos to find the exact names
    aws ecr describe-repositories --query 'repositories[].repositoryName' --output table
    
    # Then check images in a specific repo
    aws ecr describe-images --repository-name ubtrace-<env>/ub-backend
    
  3. Verify region matches between ECR and EKS

For private registries:

  1. Verify imagePullSecrets is configured in Helm values

  2. Check secret exists: kubectl get secret <secret-name>

RDS Connection Refused

Symptom: API pods crash with database connection errors

  1. Verify security groups allow PostgreSQL (port 5432) from EKS nodes

  2. Check credentials in SSM: aws ssm get-parameter --name "/ubtrace/<env>/db/app/password"

  3. Test connectivity: kubectl run pg-test --rm -it --image=postgres:16-alpine -- psql <connection-string>

Redis Auth / TLS Failures

Symptom: NOAUTH Authentication required or TLS handshake errors

  1. ElastiCache with auth token requires TLS. Ensure redis.external.tls: true

  2. Verify auth token matches SSM value

  3. Test: kubectl run redis-test --rm -it --image=redis:7-alpine -- redis-cli -h <host> --tls -a <token> ping

Keycloak Redirect Loops

Symptom: Browser redirects in a loop after login

  1. Verify keycloak.hostname matches the public URL exactly

  2. Check oidc.issuer includes /realms/ubtrace

  3. Ensure ALB health check path is /health/ready (not /)

  4. Verify cookie domain matches (no cross-domain issues)