> ## Documentation Index
> Fetch the complete documentation index at: https://docs.xpander.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Troubleshooting

> Diagnose and resolve common issues with xpander.ai self-hosted deployments

## Cluster and Pod Issues

<AccordionGroup>
  <Accordion title="Pods stuck in Init">
    The init containers wait for `agent-controller` to be ready. Check its logs:

    ```bash theme={"dark"}
    kubectl logs -n xpander deployment/xpander-agent-controller
    ```

    **Common cause:** Cannot reach the deployment manager. If using PrivateLink, verify the DNS and security groups are configured correctly — see [Configure PrivateLink](/self-hosted/privatelink).
  </Accordion>

  <Accordion title="Pods stuck in Pending">
    **Cause:** Storage class or PVC issues.

    ```bash theme={"dark"}
    kubectl -n xpander get pvc
    kubectl -n xpander describe pvc
    ```

    Check that a default StorageClass exists and the EBS CSI driver is running:

    ```bash theme={"dark"}
    kubectl get storageclass  # Should show a default (gp3)
    kubectl get pods -n kube-system -l app.kubernetes.io/name=aws-ebs-csi-driver  # Should show Running
    ```

    If using EKS, ensure the node role has `AmazonEBSCSIDriverPolicy` attached.
  </Accordion>

  <Accordion title="exec format error">
    You are running on ARM/Graviton nodes. xpander images are **amd64 only**. Switch to x86 instance types (`t3`, `m5`, `c5`, etc.).
  </Accordion>

  <Accordion title="Insufficient CPU">
    The `agent-worker` pod requests 2 CPU by default. Options:

    * Add more nodes or use larger instances
    * For non-production environments only:

    ```bash theme={"dark"}
    kubectl patch deployment xpander-agent-worker -n xpander \
      --type='json' \
      -p='[{"op":"replace","path":"/spec/template/spec/containers/0/resources/requests/cpu","value":"500m"}]'
    ```
  </Accordion>

  <Accordion title="Health checks failing">
    Check application logs:

    ```bash theme={"dark"}
    kubectl -n xpander logs deployment/xpander-agent-controller
    kubectl -n xpander logs deployment/xpander-ai-gateway
    kubectl -n xpander logs deployment/xpander-agent-worker
    kubectl -n xpander logs deployment/xpander-mcp
    ```
  </Accordion>
</AccordionGroup>

## PrivateLink Issues

<AccordionGroup>
  <Accordion title="PrivateLink InvalidServiceName (cross-region)">
    When creating a VPC endpoint to the xpander service from any region other than `us-west-2`, you **must** include `--service-region us-west-2`:

    ```bash theme={"dark"}
    aws ec2 create-vpc-endpoint \
      --service-name com.amazonaws.vpce.us-west-2.vpce-svc-0101884b32f655197 \
      --service-region us-west-2 \
      ...
    ```

    Without `--service-region`, AWS looks for the service in your local region and fails with `InvalidServiceName`.
  </Accordion>

  <Accordion title="PrivateLink connection timeout (HTTP 000)">
    Check the security group on the VPC endpoint allows inbound TCP 443 from your VPC CIDR:

    ```bash theme={"dark"}
    aws ec2 authorize-security-group-ingress \
      --group-id <ENDPOINT_SG_ID> \
      --protocol tcp --port 443 --cidr <VPC_CIDR> \
      --region <REGION> --profile <PROFILE>
    ```

    Also verify the private DNS hosted zone and alias record were created correctly — see [Configure PrivateLink](/self-hosted/privatelink#4-create-private-dns).
  </Accordion>
</AccordionGroup>

## Ingress and Networking Issues

<AccordionGroup>
  <Accordion title="Ingress not accessible">
    Verify ingress configuration:

    ```bash theme={"dark"}
    kubectl -n xpander describe ingress
    kubectl -n xpander get ingress
    ```

    Check that the NLB was provisioned and DNS CNAME records point to it:

    ```bash theme={"dark"}
    kubectl get svc -n ingress-nginx ingress-nginx-controller \
      -o jsonpath='{.status.loadBalancer.ingress[0].hostname}'
    ```
  </Accordion>

  <Accordion title="Load Balancer not provisioning">
    If using EKS Auto Mode, ensure the cluster role trust policy includes `sts:TagSession`:

    ```json theme={"dark"}
    {
      "Effect": "Allow",
      "Principal": { "Service": "eks.amazonaws.com" },
      "Action": ["sts:AssumeRole", "sts:TagSession"]
    }
    ```
  </Accordion>

  <Accordion title="SSL certificate errors on chat URLs">
    The ACM certificate must include `*.chat.<DOMAIN>` as a subject alternative name. The chat UI generates per-thread subdomains (e.g., `moccasin-prawn.chat.<DOMAIN>`) that are not covered by `*.<DOMAIN>`.

    Request a new certificate with:

    ```bash theme={"dark"}
    aws acm request-certificate \
      --domain-name "<DOMAIN>" \
      --subject-alternative-names "*.<DOMAIN>" "*.chat.<DOMAIN>" \
      --validation-method DNS \
      --region <REGION> --profile <PROFILE>
    ```
  </Accordion>
</AccordionGroup>

## API Key Issues

<AccordionGroup>
  <Accordion title="API key not being picked up after helm upgrade">
    The `xpander-static` secret has a Helm resource keep policy. Set the key directly:

    ```bash theme={"dark"}
    kubectl patch secret xpander-static -n xpander --type=merge \
      -p "{\"data\":{\"ANTHROPIC_API_KEY\":\"$(echo -n '<YOUR_KEY>' | base64)\"}}"

    kubectl rollout restart deployment xpander-agent-worker -n xpander
    ```

    See the [secret field name mapping](/self-hosted/deployment#managing-api-keys) for all key names.
  </Accordion>

  <Accordion title="API key configuration — checking current values">
    ```bash theme={"dark"}
    kubectl -n xpander exec deployment/xpander-agent-worker -- env | grep API_KEY
    ```
  </Accordion>
</AccordionGroup>

## Debug Commands

```bash theme={"dark"}
# Get all resources
kubectl -n xpander get all

# Check events
kubectl -n xpander get events --sort-by=.metadata.creationTimestamp

# Describe problematic pods
kubectl -n xpander describe pod <pod-name>

# Check service endpoints
kubectl -n xpander get endpoints
```
