Install in Kubernetes
This guide provides step-by-step instructions for deploying the vLLM Semantic Router with Envoy AI Gateway on Kubernetes.
Architecture Overview​
The deployment consists of:
- vLLM Semantic Router: Provides intelligent request routing and semantic understanding
- Envoy Gateway: Core gateway functionality and traffic management
- Envoy AI Gateway: AI Gateway built on Envoy Gateway for LLM providers
- Gateway API Inference Extension: CRDs for managing inference pools
Prerequisites​
Before starting, ensure you have the following tools installed:
- kind - Kubernetes in Docker (Optional)
- kubectl - Kubernetes CLI
- Helm - Package manager for Kubernetes
Step 1: Create Kind Cluster (Optional)​
Create a local Kubernetes cluster optimized for the semantic router workload:
# Create cluster with optimized resource settings
kind create cluster --name semantic-router-cluster --config tools/kind/kind-config.yaml
# Verify cluster is ready
kubectl wait --for=condition=Ready nodes --all --timeout=300s
Note: The kind configuration provides sufficient resources (8GB+ RAM, 4+ CPU cores) for running the semantic router and AI gateway components.
Step 2: Deploy vLLM Semantic Router​
Configure the semantic router by editing deploy/kubernetes/config.yaml
. This file contains the vLLM configuration, including model config, endpoints, and policies. The repository provides two Kustomize overlays similar to docker-compose profiles:
- core (default): only the semantic-router
- Path:
deploy/kubernetes/overlays/core
(rootdeploy/kubernetes/
points here by default)
- Path:
- llm-katan: semantic-router + an llm-katan sidecar listening on 8002 and serving model name
qwen3
- Path:
deploy/kubernetes/overlays/llm-katan
- Path:
Important notes before you apply manifests:
vllm_endpoints.address
must be an IP address (not hostname) reachable from inside the cluster. If your LLM backends run as K8s Services, use the ClusterIP (for example10.96.0.10
) and setport
accordingly. Do not include protocol or path.- The PVC in
deploy/kubernetes/pvc.yaml
usesstorageClassName: standard
. On some clouds or local clusters, the default StorageClass name may differ (e.g.,standard-rwo
,gp2
, or a provisioner like local-path). Adjust as needed. - Default PVC size is 30Gi. Size it to at least 2–3x of your total model footprint to leave room for indexes and updates.
- The initContainer downloads several models from Hugging Face on first run and writes them into the PVC. Ensure outbound egress to Hugging Face is allowed and there is at least ~6–8 GiB free space for the models specified.
- Per mode, the init container downloads differ:
- core: classifiers + the embedding model
sentence-transformers/all-MiniLM-L12-v2
into/app/models/all-MiniLM-L12-v2
. - llm-katan: everything in core, plus
Qwen/Qwen3-0.6B
into/app/models/Qwen/Qwen3-0.6B
.
- core: classifiers + the embedding model
- The default
config.yaml
points toqwen3
at127.0.0.1:8002
, which matches the llm-katan overlay. If you use core (no sidecar), either changevllm_endpoints
to your actual backend Service IP:Port, or deploy the llm-katan overlay.
Deploy the semantic router service with all required components (core mode by default):
# Deploy semantic router (core mode)
kubectl apply -k deploy/kubernetes/
# Wait for deployment to be ready (this may take several minutes for model downloads)
kubectl wait --for=condition=Available deployment/semantic-router -n vllm-semantic-router-system --timeout=600s
# Verify deployment status
kubectl get pods -n vllm-semantic-router-system
To run with the llm-katan overlay instead:
```bash
kubectl apply -k deploy/kubernetes/overlays/llm-katan
## Step 3: Install Envoy Gateway
Install the core Envoy Gateway for traffic management:
```bash
# Install Envoy Gateway using Helm
helm upgrade -i eg oci://docker.io/envoyproxy/gateway-helm \
--version v0.0.0-latest \
--namespace envoy-gateway-system \
--create-namespace
# Wait for Envoy Gateway to be ready
kubectl wait --timeout=300s -n envoy-gateway-system deployment/envoy-gateway --for=condition=Available
Step 4: Install Envoy AI Gateway​
Install the AI-specific extensions for inference workloads:
# Install Envoy AI Gateway using Helm
helm upgrade -i aieg oci://docker.io/envoyproxy/ai-gateway-helm \
--version v0.0.0-latest \
--namespace envoy-ai-gateway-system \
--create-namespace
# Wait for AI Gateway Controller to be ready
kubectl wait --timeout=300s -n envoy-ai-gateway-system deployment/ai-gateway-controller --for=condition=Available
Step 5: Install Gateway API Inference Extension​
Install the Custom Resource Definitions (CRDs) for managing inference pools:
# Install Gateway API Inference Extension CRDs
kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/releases/download/v1.0.1/manifests.yaml
# Verify CRDs are installed
kubectl get crd | grep inference
Step 6: Configure AI Gateway​
Apply the AI Gateway configuration to connect with the semantic router:
# Apply AI Gateway configuration
kubectl apply -f deploy/kubernetes/ai-gateway/configuration
# Restart controllers to pick up new configuration
kubectl rollout restart -n envoy-gateway-system deployment/envoy-gateway
kubectl rollout restart -n envoy-ai-gateway-system deployment/ai-gateway-controller
# Wait for controllers to be ready
kubectl wait --timeout=120s -n envoy-gateway-system deployment/envoy-gateway --for=condition=Available
kubectl wait --timeout=120s -n envoy-ai-gateway-system deployment/ai-gateway-controller --for=condition=Available
Step 7: Create Inference Pool​
Create the inference pool that connects the gateway to the semantic router backend:
# Create inference pool configuration
kubectl apply -f deploy/kubernetes/ai-gateway/inference-pool
# Wait for inference pool to be ready
sleep 30
Step 8: Verify Deployment​
Verify that the inference pool has been created and is properly configured:
# Check inference pool status
kubectl get inferencepool vllm-semantic-router -n vllm-semantic-router-system -o yaml
Expected output should show the inference pool in Accepted
state:
status:
parent:
- conditions:
- lastTransitionTime: "2025-09-27T09:27:32Z"
message:
"InferencePool has been Accepted by controller ai-gateway-controller:
InferencePool reconciled successfully"
observedGeneration: 1
reason: Accepted
status: "True"
type: Accepted
- lastTransitionTime: "2025-09-27T09:27:32Z"
message:
"Reference resolution by controller ai-gateway-controller: All references
resolved successfully"
observedGeneration: 1
reason: ResolvedRefs
status: "True"
type: ResolvedRefs
parentRef:
group: gateway.networking.k8s.io
kind: Gateway
name: vllm-semantic-router
namespace: vllm-semantic-router-system
Testing the Deployment​
Method 1: Port Forwarding (Recommended for Local Testing)​
Set up port forwarding to access the gateway locally:
# Set up environment variables
export GATEWAY_IP="localhost:8080"
# Get the Envoy service name
export ENVOY_SERVICE=$(kubectl get svc -n envoy-gateway-system \
--selector=gateway.envoyproxy.io/owning-gateway-namespace=vllm-semantic-router-system,gateway.envoyproxy.io/owning-gateway-name=vllm-semantic-router \
-o jsonpath='{.items[0].metadata.name}')
# Start port forwarding (run in background or separate terminal)
kubectl port-forward -n envoy-gateway-system svc/$ENVOY_SERVICE 8080:80
Method 2: External IP (For Production Deployments)​
For production deployments with external load balancers:
# Get the Gateway external IP
GATEWAY_IP=$(kubectl get gateway vllm-semantic-router -n vllm-semantic-router-system -o jsonpath='{.status.addresses[0].value}')
echo "Gateway IP: $GATEWAY_IP"
Send Test Requests​
Once the gateway is accessible, test the inference endpoint:
# Test math domain chat completions endpoint
curl -i -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "auto",
"messages": [
{"role": "user", "content": "What is the derivative of f(x) = x^3 + 2x^2 - 5x + 7?"}
]
}'
Troubleshooting​
Common Issues​
Gateway not accessible:
# Check gateway status
kubectl get gateway vllm-semantic-router -n vllm-semantic-router-system
# Check Envoy service
kubectl get svc -n envoy-gateway-system
Inference pool not ready:
# Check inference pool events
kubectl describe inferencepool vllm-semantic-router -n vllm-semantic-router-system
# Check AI gateway controller logs
kubectl logs -n envoy-ai-gateway-system deployment/ai-gateway-controller
Semantic router not responding:
# Check semantic router pod status
kubectl get pods -n vllm-semantic-router-system
# Check semantic router logs
kubectl logs -n vllm-semantic-router-system deployment/semantic-router
Cleanup​
To remove the entire deployment:
# Remove inference pool
kubectl delete -f deploy/kubernetes/ai-gateway/inference-pool
# Remove AI gateway configuration
kubectl delete -f deploy/kubernetes/ai-gateway/configuration
# Remove semantic router
kubectl delete -k deploy/kubernetes/
# Remove AI gateway
helm uninstall aieg -n envoy-ai-gateway-system
# Remove Envoy gateway
helm uninstall eg -n envoy-gateway-system
# Remove Gateway API CRDs (optional)
kubectl delete -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/releases/download/v1.0.1/manifests.yaml
# Delete kind cluster
kind delete cluster --name semantic-router-cluster
Next Steps​
- Configure custom routing rules in the AI Gateway
- Set up monitoring and observability
- Implement authentication and authorization
- Scale the semantic router deployment for production workloads