Troubleshooting - xpander.ai

Cluster and Pod Issues

Pods stuck in Init

The init containers wait for agent-controller to be ready. Check its logs:

kubectl logs -n xpander deployment/xpander-agent-controller

Common cause: Cannot reach the deployment manager. If using PrivateLink, verify the DNS and security groups are configured correctly — see Configure PrivateLink.

Pods stuck in Pending

Cause: Storage class or PVC issues.

kubectl -n xpander get pvc
kubectl -n xpander describe pvc

Check that a default StorageClass exists and the EBS CSI driver is running:

kubectl get storageclass  # Should show a default (gp3)
kubectl get pods -n kube-system -l app.kubernetes.io/name=aws-ebs-csi-driver  # Should show Running

If using EKS, ensure the node role has AmazonEBSCSIDriverPolicy attached.

exec format error

You are running on ARM/Graviton nodes. xpander images are amd64 only. Switch to x86 instance types (t3, m5, c5, etc.).

Insufficient CPU

The agent-worker pod requests 2 CPU by default. Options:

Add more nodes or use larger instances
For non-production environments only:

kubectl patch deployment xpander-agent-worker -n xpander \
  --type='json' \
  -p='[{"op":"replace","path":"/spec/template/spec/containers/0/resources/requests/cpu","value":"500m"}]'

Health checks failing

Check application logs:

kubectl -n xpander logs deployment/xpander-agent-controller
kubectl -n xpander logs deployment/xpander-ai-gateway
kubectl -n xpander logs deployment/xpander-agent-worker
kubectl -n xpander logs deployment/xpander-mcp

PrivateLink Issues

PrivateLink InvalidServiceName (cross-region)

When creating a VPC endpoint to the xpander service from any region other than us-west-2, you must include --service-region us-west-2:

aws ec2 create-vpc-endpoint \
  --service-name com.amazonaws.vpce.us-west-2.vpce-svc-0101884b32f655197 \
  --service-region us-west-2 \
  ...

Without --service-region, AWS looks for the service in your local region and fails with InvalidServiceName.

PrivateLink connection timeout (HTTP 000)

Check the security group on the VPC endpoint allows inbound TCP 443 from your VPC CIDR:

aws ec2 authorize-security-group-ingress \
  --group-id <ENDPOINT_SG_ID> \
  --protocol tcp --port 443 --cidr <VPC_CIDR> \
  --region <REGION> --profile <PROFILE>

Also verify the private DNS hosted zone and alias record were created correctly — see Configure PrivateLink.

Ingress and Networking Issues

Ingress not accessible

Verify ingress configuration:

kubectl -n xpander describe ingress
kubectl -n xpander get ingress

Check that the NLB was provisioned and DNS CNAME records point to it:

kubectl get svc -n ingress-nginx ingress-nginx-controller \
  -o jsonpath='{.status.loadBalancer.ingress[0].hostname}'

Load Balancer not provisioning

If using EKS Auto Mode, ensure the cluster role trust policy includes sts:TagSession:

{
  "Effect": "Allow",
  "Principal": { "Service": "eks.amazonaws.com" },
  "Action": ["sts:AssumeRole", "sts:TagSession"]
}

SSL certificate errors on chat URLs

The ACM certificate must include *.chat.<DOMAIN> as a subject alternative name. The chat UI generates per-thread subdomains (e.g., moccasin-prawn.chat.<DOMAIN>) that are not covered by *.<DOMAIN>.Request a new certificate with:

aws acm request-certificate \
  --domain-name "<DOMAIN>" \
  --subject-alternative-names "*.<DOMAIN>" "*.chat.<DOMAIN>" \
  --validation-method DNS \
  --region <REGION> --profile <PROFILE>

API Key Issues

API key not being picked up after helm upgrade

The xpander-static secret has a Helm resource keep policy. Set the key directly:

kubectl patch secret xpander-static -n xpander --type=merge \
  -p "{\"data\":{\"ANTHROPIC_API_KEY\":\"$(echo -n '<YOUR_KEY>' | base64)\"}}"

kubectl rollout restart deployment xpander-agent-worker -n xpander

See the secret field name mapping for all key names.

API key configuration — checking current values

kubectl -n xpander exec deployment/xpander-agent-worker -- env | grep API_KEY

Debug Commands

# Get all resources
kubectl -n xpander get all

# Check events
kubectl -n xpander get events --sort-by=.metadata.creationTimestamp

# Describe problematic pods
kubectl -n xpander describe pod <pod-name>

# Check service endpoints
kubectl -n xpander get endpoints

​Cluster and Pod Issues

​PrivateLink Issues

​Ingress and Networking Issues

​API Key Issues

​Debug Commands

Cluster and Pod Issues

PrivateLink Issues

Ingress and Networking Issues

API Key Issues

Debug Commands