Skip to main content

Cluster and Pod Issues

The init containers wait for agent-controller to be ready. Check its logs:
kubectl logs -n xpander deployment/xpander-agent-controller
Common cause: Cannot reach the deployment manager. If using PrivateLink, verify the DNS and security groups are configured correctly — see Configure PrivateLink.
Cause: Storage class or PVC issues.
kubectl -n xpander get pvc
kubectl -n xpander describe pvc
Check that a default StorageClass exists and the EBS CSI driver is running:
kubectl get storageclass  # Should show a default (gp3)
kubectl get pods -n kube-system -l app.kubernetes.io/name=aws-ebs-csi-driver  # Should show Running
If using EKS, ensure the node role has AmazonEBSCSIDriverPolicy attached.
You are running on ARM/Graviton nodes. xpander images are amd64 only. Switch to x86 instance types (t3, m5, c5, etc.).
The agent-worker pod requests 2 CPU by default. Options:
  • Add more nodes or use larger instances
  • For non-production environments only:
kubectl patch deployment xpander-agent-worker -n xpander \
  --type='json' \
  -p='[{"op":"replace","path":"/spec/template/spec/containers/0/resources/requests/cpu","value":"500m"}]'
Check application logs:
kubectl -n xpander logs deployment/xpander-agent-controller
kubectl -n xpander logs deployment/xpander-ai-gateway
kubectl -n xpander logs deployment/xpander-agent-worker
kubectl -n xpander logs deployment/xpander-mcp

Ingress and Networking Issues

Verify ingress configuration:
kubectl -n xpander describe ingress
kubectl -n xpander get ingress
Check that the NLB was provisioned and DNS CNAME records point to it:
kubectl get svc -n ingress-nginx ingress-nginx-controller \
  -o jsonpath='{.status.loadBalancer.ingress[0].hostname}'
If using EKS Auto Mode, ensure the cluster role trust policy includes sts:TagSession:
{
  "Effect": "Allow",
  "Principal": { "Service": "eks.amazonaws.com" },
  "Action": ["sts:AssumeRole", "sts:TagSession"]
}
The ACM certificate must include *.chat.<DOMAIN> as a subject alternative name. The chat UI generates per-thread subdomains (e.g., moccasin-prawn.chat.<DOMAIN>) that are not covered by *.<DOMAIN>.Request a new certificate with:
aws acm request-certificate \
  --domain-name "<DOMAIN>" \
  --subject-alternative-names "*.<DOMAIN>" "*.chat.<DOMAIN>" \
  --validation-method DNS \
  --region <REGION> --profile <PROFILE>

API Key Issues

The xpander-static secret has a Helm resource keep policy. Set the key directly:
kubectl patch secret xpander-static -n xpander --type=merge \
  -p "{\"data\":{\"ANTHROPIC_API_KEY\":\"$(echo -n '<YOUR_KEY>' | base64)\"}}"

kubectl rollout restart deployment xpander-agent-worker -n xpander
See the secret field name mapping for all key names.
kubectl -n xpander exec deployment/xpander-agent-worker -- env | grep API_KEY

Debug Commands

# Get all resources
kubectl -n xpander get all

# Check events
kubectl -n xpander get events --sort-by=.metadata.creationTimestamp

# Describe problematic pods
kubectl -n xpander describe pod <pod-name>

# Check service endpoints
kubectl -n xpander get endpoints