Skip to main content

LiteLLM on Google Kubernetes Engine (GKE Autopilot)

This document provides a comprehensive reference for the modules/LiteLLM_GKE Terraform module — deploying LiteLLM on GKE Autopilot.


1. Module Overview

LiteLLM GKE deploys LiteLLM — the open-source LLM proxy and AI gateway — on GKE Autopilot with Kubernetes-native scaling, Workload Identity IAM, Cloud SQL Auth Proxy for PostgreSQL, and the full Foundation Module (App GKE) infrastructure stack.

Key differences from LiteLLM CloudRun:

  • Runs as a Kubernetes Deployment on GKE Autopilot instead of Cloud Run.
  • Uses the GCS Fuse CSI driver for storage mounts.
  • Horizontal Pod Autoscaler (HPA) for scaling instead of Cloud Run's built-in scaling.
  • Workload Identity for GCP API access.
  • credit_cost defaults to 150 (vs 50 for Cloud Run).

GCP Services deployed:

  • GKE Autopilot cluster (via Services GCP)
  • Kubernetes Deployments, Services, Jobs
  • HPA
  • Artifact Registry
  • Cloud Storage
  • Cloud SQL PostgreSQL 15
  • Cloud SQL Auth Proxy
  • Workload Identity
  • Secret Manager
  • Cloud Monitoring + Uptime Checks
  • Redis (optional)

2. Core Service Configuration

A. Compute (GKE)

VariableGroupDefaultDescription
deploy_application4trueSet false for infrastructure-only deployment.
cpu_limit4'2000m'CPU per pod.
memory_limit4'2Gi'Memory per pod.
min_instance_count41Minimum pod replicas.
max_instance_count43Maximum pod replicas (HPA ceiling).
container_port44000LiteLLM's native port.
enable_cloudsql_volume4trueInjects Cloud SQL Auth Proxy sidecar into pods.
timeout_seconds4600Request timeout.

B. Database (Cloud SQL — PostgreSQL 15)

Same as LiteLLM CloudRun. See LiteLLM CloudRun §3.B.

VariableGroupDefaultDescription
database_type12'POSTGRES_15'Required. LiteLLM uses PostgreSQL for Prisma ORM.
db_name12'litellm_db'PostgreSQL database name.
db_user12'litellm_user'PostgreSQL application user.
database_password_length1232Auto-generated password length.
enable_auto_password_rotation12falseAutomated password rotation.

C. Application Settings

VariableGroupDefaultDescription
environment_variables6{ LITELLM_LOG="INFO", NUM_WORKERS="1" }Plain-text env vars.
secret_environment_variables6{}Secret Manager references for LLM provider API keys.

D. Storage & Networking

Same variables as LiteLLM CloudRun. See LiteLLM CloudRun §3.E and §3.F.

E. Initialization

Same as LiteLLM CloudRun. The default db-init job from LiteLLM Common creates the PostgreSQL database and user.


3. GKE-Specific Features

A. StatefulSet and Persistent Volumes

VariableGroupDefaultDescription
stateful_pvc_enabledfalseEnables PVC for the pod. Automatically uses StatefulSet.
workload_type'Deployment''Deployment' or 'StatefulSet'.
quota_memory_requestsResourceQuota memory requests. Must use binary suffixes ('4Gi').
quota_memory_limitsResourceQuota memory limits. Must use binary suffixes.

B. Horizontal Pod Autoscaler

HPA scales between min_instance_count and max_instance_count based on CPU/memory utilization. LiteLLM stateless request routing scales well horizontally.


4. Advanced Security

Identical to LiteLLM CloudRun for Cloud Armor, Binary Authorization, and VPC Service Controls.

Workload Identity: The GKE Kubernetes SA is annotated with the GCP SA email. roles/datastore.user (if needed) and Cloud SQL client roles are bound via Workload Identity.


5. Redis Caching

Same as LiteLLM CloudRun. Redis response caching is optional but recommended for high-throughput deployments.

VariableGroupDefaultDescription
enable_redis21falseEnables Redis response caching.
redis_host21""Redis hostname or IP.
redis_port21'6379'Redis TCP port.
redis_auth21""Redis AUTH password. Sensitive.

6. Observability

VariableGroupDefaultDescription
startup_probe14{ path="/health/readiness", initial_delay_seconds=60, failure_threshold=6 }Pod startup probe.
liveness_probe14{ path="/health/liveliness", initial_delay_seconds=30, failure_threshold=3 }Pod liveness probe.
uptime_check_config14{ enabled=true, path="/health/liveliness" }Cloud Monitoring uptime check.
alert_policies14[]Cloud Monitoring metric alert policies.

7. Platform-Managed Behaviours

BehaviourDetail
PostgreSQL 15 requireddatabase_type = "POSTGRES_15" fixed by LiteLLM Common.
Custom Docker imageimage_source = "custom" — Cloud Build creates a custom image with the LiteLLM entrypoint script.
LITELLM_MASTER_KEY / LITELLM_SALT_KEYAuto-generated by LiteLLM Common, stored in Secret Manager.
Default db-init jobInjected by LiteLLM Common when initialization_jobs = [].
Workload IdentityGKE SA bound via Workload Identity for GCP API access.
Cloud SQL Auth ProxyInjected as a sidecar container into each pod.

8. Outputs

OutputDescription
kubernetes_readyTrue when GKE cluster and Kubernetes resources are deployed.
deployment_idDeployment ID suffix used in resource names.
database_instance_nameCloud SQL PostgreSQL instance name.
database_nameLiteLLM database name.

Configuration Pitfalls & Sensible Defaults

Risk levels: Critical (data loss, full outage, security breach) — High (service unavailable or significant degradation) — Medium (degraded function or increased cost) — Low (minor impact).

VariableSensible DefaultRiskConsequence of Incorrect Value
LITELLM_MASTER_KEY (auto-generated)"sk-<random>" in Secret ManagerCriticalControls all administrative operations and authenticates proxy API calls. Rotation immediately breaks all integrations holding existing virtual keys or the master key. Treat as immutable unless performing a coordinated key rotation with all consumers.
LITELLM_SALT_KEY (auto-generated)Random secret in Secret ManagerCriticalSalts all virtual API keys in the database. Changing it makes every previously issued virtual key permanently invalid. All API consumers must be issued new keys. Treat as permanently immutable.
STORE_MODEL_IN_DB (via environment_variables)"True"HighRequired for database-backed model and key management via the Admin UI. Setting to "False" disables the Admin UI model management and reverts to YAML-file-only configuration.
enable_cloudsql_volumetrueCriticalThe Cloud SQL Auth Proxy sidecar is required for PostgreSQL connectivity in the GKE pod. Disabling it causes Prisma to fail connecting to the database at startup.
database_type"POSTGRES"CriticalLiteLLM requires PostgreSQL for virtual key management and spend tracking. Without it, the STORE_MODEL_IN_DB features are unavailable and key management is disabled.
enable_redisfalseHighWithout Redis, rate-limit counters are per-pod and not shared across replicas. For accurate rate limiting and response caching in a multi-replica GKE deployment, Redis is essential.
redis_host""HighMust be set when enable_redis = true. An empty redis_host with Redis enabled causes LiteLLM to log cache connection errors on every request.
quota_memory_requests / quota_memory_limitsBinary unit defaultsCriticalMust include binary unit suffixes (Gi, Mi). Bare integer values are treated as bytes by Kubernetes and block all pod scheduling.
stateful_pvc_enabledfalseMediumGCS Fuse is the default persistence backend for any config files. Enabling PVC storage for LiteLLM config prevents pod migration across nodes and complicates rolling updates.
workload_typenull (auto-select)MediumSetting stateful_pvc_enabled = true alongside workload_type = "Deployment" fails at plan time. Let auto-selection handle this.
min_instance_count1HighLiteLLM is a shared API gateway. Cold starts (30–60 s on GKE Autopilot node provision) cause queuing in all dependent services. Keep at least 1 replica running at all times.
timeout_seconds600HighLarge language model inference can take several minutes. Proxy requests are terminated by the load balancer if the backend pod takes longer than timeout_seconds to respond.
service_type (Kubernetes)"ClusterIP"CriticalExposing with LoadBalancer makes the LiteLLM master key and all virtual keys accessible over the public internet. Always use ClusterIP with an authenticated ingress or Gateway in front.
iap_oauth_client_id / iap_oauth_client_secret""CriticalRequired when enable_iap = true. Missing values prevent the IAP gateway from initialising and make the service unreachable.
NUM_WORKERS (via environment_variables)"1"MediumA single worker serialises all requests. Increase to 24 for high-throughput deployments and scale cpu_limit proportionally.
backup_schedule"" (disabled)HighThe PostgreSQL database holds all virtual keys and spend data. Without automated backups, accidental deletion or corruption causes permanent loss of key assignments and usage history.
enable_vertical_pod_autoscalingfalseMediumEnabling VPA disables HPA (conflict). On GKE Autopilot, VPA is the recommended approach for right-sizing pods. Choose one or the other.
application_version"main-stable"MediumLiteLLM releases frequently and may change the Prisma schema or break virtual key formats. Pin to a specific release for production stability.