Skip to main content

App GKE on Google Cloud Platform

This document provides a comprehensive analysis of the modules/App_GKE Terraform module on Google Cloud Platform. It details the architecture, IAM configuration, service integrations, and potential enhancements.


1. Module Overview

The modules/App_GKE module is a foundational building block for deploying containerized applications on Google Kubernetes Engine (GKE) Autopilot. It is designed to be highly configurable and orchestrates not just the compute layer, but also the surrounding ecosystem of networking, storage, databases, and observability.

Key Capabilities:

  • Compute: Deploys Kubernetes Deployments, StatefulSets, and CronJobs on GKE Autopilot with built-in auto-scaling (HPA/VPA).
  • Data Persistence: Integrates with Cloud SQL (via Auth Proxy sidecar), NFS, GCS (including GCS Fuse CSI Driver), and Persistent Volume Claims for stateful workloads.
  • Recurring Tasks: Supports scheduled operations via Kubernetes CronJobs.
  • Lifecycle Management: Supports initialization jobs (DB migrations, backups, setup) as Kubernetes Jobs.
  • CI/CD: Built-in support for Cloud Build triggers and image mirroring.
  • Service Mesh: When the module provisions its own GKE cluster (configure_service_mesh = true), it registers the cluster with GKE Fleet Hub and deploys Cloud Service Mesh with MANAGEMENT_AUTOMATIC. On externally-provisioned clusters, Istio sidecar injection is enabled via namespace labels only.

Required versions (versions.tf):

Tool / ProviderConstraintWhy
Terraform / OpenTofu>= 1.5.0optional() in object variables, terraform_data, check{}, sensitive() semantics
hashicorp/google>= 6.0.0Cloud Deploy v2, GKE Gateway API, Binary Authorization, Workload Identity schema
hashicorp/google-beta~> 6.0Matches google for beta-only Cloud Service Mesh / Fleet resources
hashicorp/kubernetes>= 2.25.0kubernetes_manifest with server-side CRD validation
hashicorp/random>= 3.0.0Random ID generation for resource naming
hashicorp/external>= 2.0.0Discovery scripts and CSI driver CRD probe
hashicorp/null>= 3.0.0IAM propagation waits and addon enablement triggers
integrations/github>= 5.0.0Cloud Build v2 GitHub connection

2. IAM & Access Control

The module implements a least-privilege IAM strategy using dedicated Service Accounts and Workload Identity. IAM bindings are managed directly in prerequisites.tf. A time_sleep of 120 seconds (cloudbuild_iam_propagation) gates Cloud Build builds on IAM propagation to prevent storage.objects.get denied errors before the Storage Admin binding has reached GCP's storage backend. The GKE workload SA (gke-sa-*) and Cloud Build SA (gke-build-sa-*) are created unconditionally (count = 1) on every deployment, regardless of whether a Services GCP network is present. The NFS SA and inline GKE node SA are conditional on whether those inline resources are needed.

Service Accounts

  1. GKE Workload Service Account (gke_sa, format: gke-sa-<random_id>):

    • Identity: The GCP identity bound to the Kubernetes Service Account via Workload Identity.
    • Project-level roles: roles/compute.networkUser, roles/secretmanager.secretAccessor, roles/storage.objectUser, roles/storage.objectAdmin, roles/cloudsql.client, roles/vpcaccess.user, roles/container.developer, roles/logging.logWriter, roles/monitoring.metricWriter.
  2. Cloud Build Service Account (cloud_build_sa, format: gke-build-sa-<random_id>):

    • Identity: Used by Cloud Build triggers for CI/CD.
    • Project-level roles: roles/secretmanager.secretAccessor, roles/cloudbuild.builds.editor, roles/viewer, roles/storage.objectAdmin, roles/artifactregistry.reader, roles/artifactregistry.writer, roles/container.admin, roles/iam.serviceAccountUser, roles/clouddeploy.operator, roles/logging.logWriter, roles/run.admin, roles/iam.serviceAccountTokenCreator.
    • roles/secretmanager.secretAccessor on the GitHub token secret (granted to three principals: the user-managed Cloud Build SA, the default Cloud Build service account, and the Cloud Build service agent service-<project_number>@gcp-sa-cloudbuild.iam.gserviceaccount.com).
  3. Inline NFS Service Account (inline_nfs_sa, format: app-nfs-sa-<random_id>):

    • Created when no existing NFS server is found (prereq_needs_nfs = true).
    • Roles: roles/storage.objectAdmin, roles/logging.logWriter, roles/compute.instanceAdmin.v1.
  4. Inline GKE Node Service Account (inline_gke_sa, format: app-gke-sa-<random_id>):

    • Created when no existing GKE cluster is found (prereq_needs_gke = true).
    • Roles: roles/logging.logWriter, roles/monitoring.metricWriter, roles/monitoring.viewer, roles/stackdriver.resourceMetadata.writer, roles/artifactregistry.reader, roles/storage.objectViewer.

Workload Identity

  • The module creates a Kubernetes Service Account annotated with iam.gke.io/gcp-service-account pointing to the GCP SA.
  • An IAM binding grants the K8s SA the roles/iam.workloadIdentityUser role on the GCP SA, scoped as serviceAccount:<project>.svc.id.goog[<namespace>/<ksa-name>].
  • This ensures pods authenticate to GCP APIs using projected service account tokens — no key files are created or stored.

3. Core Service Configuration

A. Compute (GKE Autopilot)

  • Workload Types: The workload_type variable accepts Deployment (stateless) or StatefulSet (stateful apps). CronJob is not a workload type — scheduled tasks are defined separately via the cron_jobs variable (see §3.E).
  • Resource Management: Configurable CPU and Memory requests/limits. Autopilot enforces minimum resource requirements.
  • Scaling: Support for Horizontal Pod Autoscaler (HPA) and Vertical Pod Autoscaler (VPA).
  • Replicas: Configurable min_instance_count and max_instance_count for scaling bounds.
  • Service Types: Configurable Kubernetes Service type (ClusterIP, LoadBalancer, NodePort).
  • Stateful Storage: Automatic PVC provisioning for StatefulSet workloads via volume claim templates.
  • Rollout timeouts: deployment_timeout (default 1800 seconds) controls how long Terraform waits for the Deployment/StatefulSet rollout during apply. Applied to create, update, and delete timeout blocks — raise for large images or slow-starting pods.
  • Null-default override pattern: Variables such as workload_type, pdb_min_available, stateful_pvc_*, and container_resources default to null rather than sentinel values. The module's modules.tf uses != null checks so wrapper modules can explicitly pass the module-preset value without it being mistaken for "unset". Note that min_instance_count (default 1) and max_instance_count (default 3) do not default to null.

B. Database (Cloud SQL)

  • Discovery & Provisioning: The module first runs a discovery script to locate an existing Cloud SQL instance in the project. If one is found it is used directly. If none is found, an inline instance is provisioned automatically — PostgreSQL via google_sql_database_instance.inline_postgres or MySQL via google_sql_database_instance.inline_mysql — along with a root password stored in Secret Manager.
  • Connectivity:
    • Cloud SQL Auth Proxy: Runs as a sidecar container in each pod, exposing the database on 127.0.0.1. The Unix socket mount path inside the container is configurable via cloudsql_volume_mount_path (default /cloudsql).
    • Direct TCP/IP: Falls back to injecting the DB_HOST IP address if the proxy is disabled (enable_cloudsql_volume = false).
  • Credentials: Automatically retrieves or generates DB passwords, stores them in Secret Manager, and syncs to Kubernetes Secrets.

C. Storage (NFS / GCS / GCS Fuse)

  1. NFS (Network File System):
    • Discovery: Automatically detects an existing NFS server in the region.
    • Mounting: Mounts the NFS share as a native Kubernetes NFS volume in pods.
  2. GCS (Object Storage):
    • Standard: Creates buckets with configurable lifecycles and versioning.
    • GCS Fuse CSI Driver: Mounts GCS buckets as file systems using the GKE-native CSI driver, with pod annotation gke-gcsfuse/volumes=true.

D. Networking & Network Policies

  • VPC Native: Pods run directly in the VPC network. No additional connector needed.
  • Cluster: Connects to a GKE Autopilot cluster using the cluster's dedicated subnet and secondary IP ranges. The cluster is resolved in the following order:
    1. An explicit cluster name supplied via gke_cluster_name (when gke_cluster_selection_mode = "explicit").
    2. A Services GCP-provisioned cluster discovered on the project's managed network.
    3. An inline cluster created by this module when no Services GCP dependency is present. In this mode the module also creates its own VPC, GKE cluster, NFS server, and Cloud SQL instance, making it fully self-contained (see §9).
  • Cluster Selection Modes (gke_cluster_selection_mode):
    • primary (default) — targets the primary GKE cluster discovered on the project's managed network.
    • explicit — targets the cluster named in gke_cluster_name directly; use when multiple clusters share the same network.
    • round-robin — distributes deployments across all clusters discovered on the network.
  • Micro-segmentation (NetworkPolicies):
    • Capability: When enable_network_segmentation is true, the module implements Kubernetes NetworkPolicy resources using GKE Dataplane V2 (Cilium).
    • Namespace Isolation: Restricts pod-to-pod traffic to only allow communication between pods within the same namespace.
    • External Access: Automatically permits ingress from Google Cloud L7 Load Balancers (Gateway API) and GKE Control Plane health checks.

E. Initialization Jobs & CronJobs

Initialization Jobs (initialization_jobs variable) are Kubernetes Jobs that run before the application starts:

  • nfs-setup: Prepares directory structures and permissions on the NFS server.
  • db-init: Runs custom scripts for database schema migration and seeding.
  • Cloud SQL Auth Proxy sidecar: Jobs with needs_db = true get a proxy sidecar with --quitquitquit for graceful termination.
  • Script ConfigMaps: Job scripts are mounted via Kubernetes ConfigMaps from the module's scripts/ directory.
  • Execution control: execute_on_apply maps to the Kubernetes provider's wait_for_completion flag. When true (default), Terraform blocks until the job completes before proceeding. When false, Terraform submits the job and continues without waiting.
  • Job ordering: depends_on_jobs is declared in the variable type and preserved in the merged job configuration, but job ordering is currently enforced implicitly through the Terraform resource dependency graph (e.g. nfs-setup completes before initialization jobs run) rather than via per-job depends_on_jobs chains.

CronJobs (cron_jobs variable) are recurring scheduled tasks independent of the main workload:

  • Each CronJob defines its own schedule (cron expression), image, resource limits, concurrency_policy, and history limits.
  • CronJobs can mount NFS volumes (mount_nfs) and GCS Fuse volumes (mount_gcs_volumes).
  • CronJobs are distinct from the workload_type variable — they always run as kubernetes_cron_job_v1 resources alongside the main Deployment or StatefulSet.

F. Additional Services

The additional_services variable defines independent helper or proxy services deployed alongside the main workload in the same namespace. Unlike sidecar containers, each entry creates its own kubernetes_deployment_v1, kubernetes_service_v1, and optional kubernetes_horizontal_pod_autoscaler_v2 — they run as separate pods.

  • Service URL injection: The main container (and StatefulSet) automatically receives a <SERVICE_NAME_UPPER>_URL environment variable (or a custom name via output_env_var_name) resolving to http://<name>:<port> using the ClusterIP Service.
  • Independent scaling: Each additional service has its own min_instance_count / max_instance_count. An HPA is created when max_instance_count > 1 and VPA is not enabled.
  • Health probes: Optional startup_probe and liveness_probe objects provide independent readiness and liveness checking.
  • Volume mounts: Entries in volume_mounts reference globally-defined volumes (NFS, GCS Fuse) by name.
  • Key variable: additional_services

Typical uses: Envoy/Nginx proxy, Redis Sentinel, background workers, or any tightly-coupled process that must co-locate with the main application but requires its own lifecycle and scaling.


4. Advanced Security

A. Cloud Armor WAF

When enable_cloud_armor = true, the module creates an inline google_compute_security_policy and attaches it to the GKE Gateway backend via GCPBackendPolicy.

Rules included in the inline policy:

PriorityActionRule
100AllowAdmin IP ranges bypass (admin_ip_ranges) — only added when admin_ip_ranges is non-empty
1000Deny (403)SQLi — sqli-v33-stable
1001Deny (403)XSS — xss-v33-stable
1002Deny (403)LFI — lfi-v33-stable
1003Deny (403)RCE — rce-v33-stable
2000Deny (429) + ban500 req/min per IP; HTTP 429 on exceed; 300-second (5-minute) ban duration
2147483647AllowDefault allow
  • Adaptive Protection: L7 DDoS detection is enabled automatically on all inline policies.
  • External policy: When enable_cloud_armor = false and cloud_armor_policy_name is set to a non-default value, the named externally-managed policy is attached instead.
  • Key variables: enable_cloud_armor, admin_ip_ranges, cloud_armor_policy_name

B. Identity-Aware Proxy (IAP)

When enable_iap = true, the application is protected by Google Cloud IAP via the GCPBackendPolicy on the Gateway:

  • OAuth: Requires a pre-created OAuth 2.0 client. Client ID and secret are supplied via iap_oauth_client_id / iap_oauth_client_secret (marked sensitive).
  • Access control: iap_authorized_users and iap_authorized_groups grant access to specific users and Google Groups respectively (e.g. user:alice@example.com, group:devs@example.com).
  • Support email: iap_support_email is displayed on the IAP consent screen.
  • Key variables: enable_iap, iap_authorized_users, iap_authorized_groups, iap_oauth_client_id, iap_oauth_client_secret, iap_support_email

C. Binary Authorization

When enable_binary_authorization = true, the module enforces image provenance before any Kubernetes workload is created.

  • Image signing: The module signs both the application image (container_image) and the DB clients image via module.app_security (backed by App_Common/modules/app_security) before Terraform creates any Kubernetes Deployment, StatefulSet, or Cloud Deploy release. Two stub null_resource resources (sign_container_image, sign_db_clients_image) propagate this ordering dependency to all caller resources.
  • Enforcement modes (controlled by binauthz_evaluation_mode):
ModeBehaviour
ALWAYS_ALLOW (default)Policy is enabled but all images are permitted — safe for initial rollout
REQUIRE_ATTESTATIONOnly images with a valid attestation from the signing key are admitted
ALWAYS_DENYAll image deployments are blocked (lockdown mode)
  • CI/CD integration: Subsequent builds via Cloud Build sign images at build time, maintaining a continuous chain of attestation.
  • Key variables: enable_binary_authorization, binauthz_evaluation_mode

D. VPC Service Controls

When enable_vpc_sc = true, the module provisions a full VPC-SC perimeter around the project's GCP APIs via module.vpc_sc (App_Common/modules/app_vpc_sc) — Services_GCP is not required. Two additional VPC-SC-friendly behaviours are present unconditionally:

  1. Cloud SQL Auth Proxy image mirroring (always active when a SQL server and Artifact Registry exist): The proxy image (gcr.io/cloud-sql-connectors/cloud-sql-proxy:${cloud_sql_proxy_version}) is mirrored into the project's own Artifact Registry so GKE nodes never pull from gcr.io (an out-of-project registry blocked by a VPC-SC perimeter). The tag is configurable via cloud_sql_proxy_version (default "2-alpine").
  2. VPC-SC compliant NetworkPolicy egress (always present when enable_network_segmentation = true): Egress rules include 199.36.153.4/30 (restricted.googleapis.com) and 199.36.153.8/30 (private.googleapis.com) — the recommended Private Google Access endpoints for GCP APIs inside a VPC-SC perimeter.

Perimeter provisioning (when enable_vpc_sc = true):

  • Organization auto-discovery: Reads the project's org_id; falls back to var.organization_id for folder-nested projects. Standalone projects (e.g. Qwiklab) emit a warning and skip the perimeter automatically.

  • Access levels (four, all suffixed with deployment_id): VPC network access (from vpc_cidr_ranges or auto-discovered from the discovered network), admin IP ranges (admin_ip_ranges), IAP service agent, and CI/CD (Cloud Build SA + resource_creator_identity).

  • Perimeter: PERIMETER_TYPE_REGULAR restricting Cloud Run, GKE, Cloud SQL, Secret Manager, Storage, Artifact Registry, Cloud Build, Certificate Manager, IAP, Compute, KMS, Pub/Sub, Redis, Filestore, and Firestore.

  • Dry-run mode: vpc_sc_dry_run = true (default) logs violations without enforcing them. Set to false to actively block out-of-perimeter API calls after validating the dry-run logs.

  • Auto-skip: No resources are created when the project has no organization, when it is folder-nested without an explicit organization_id, or when admin_ip_ranges is empty (lockout protection). Each case emits a null_resource warning with remediation guidance.

  • Key variables: enable_vpc_sc, vpc_sc_dry_run, vpc_cidr_ranges, organization_id, admin_ip_ranges

E. Secrets Store CSI Driver

Secrets are always fetched from Secret Manager and mounted into pods at start time by the GKE-managed Secrets Store CSI Driver — secret values are never written to Terraform state.

  • Driver name: The GKE managed addon uses the driver name secrets-store-gke.csi.k8s.io with provider "gke". This is distinct from the community secrets-store.csi.k8s.io driver — using the wrong driver name causes CSIDriver not found errors at pod creation.
  • Cluster enablement: Services GCP enables the addon via null_resource.configure_gke_addons using gcloud container clusters update --enable-secret-manager. For inline clusters (no Services GCP), App GKE runs the same gcloud command in null_resource.enable_secret_manager_addon. A separate null_resource.enable_secret_sync_addon then runs gcloud beta container clusters update --enable-secret-sync (requires GKE 1.33+) to install the Secret Manager Sync controller that materialises native Kubernetes Secrets from CSI-mounted files.
  • 120 s propagation wait: time_sleep.wait_for_secret_manager_addon blocks for 120 seconds after the addon is enabled so the CSI driver DaemonSet is ready on every Autopilot node before SecretProviderClass resources are applied.
  • Plan-time CRD guard: data.external.csi_driver_crd_installed probes the live cluster using kubectl get crd secretproviderclasses.secrets-store.csi.x-k8s.io. When the CRD is absent, SecretProviderClass and SecretSync resources are excluded from the plan (count = csi_driver_crd_ready && sm_backed_secret_count > 0 ? 1 : 0). A second apply materialises them.
  • sm_backed_secret_count: Counts secrets that must be backed by the CSI driver (i.e. not preset "explicit" values). The filter uses key-based exclusion (!contains(keys(local.preset_explicit_secret_values), k)) rather than value comparison (v != "explicit"). This is required because the DB_PASSWORD value flows through random_id.wrapper_deployment.hex — an attribute unknown at plan time — making any value-comparison expression unknown, which breaks count/for_each. Filtering by key is always plan-time-safe.
  • SecretProviderClass (base + per stage): A base SecretProviderClass is created in the application namespace, plus one per Cloud Deploy stage (${service_name}-${stage}-secrets). Each maps every entry in secret_environment_variables to a Secret Manager secret version with files written to the CSI volume.
  • Native Kubernetes Secrets via SecretSync: The GKE managed addon does not support the secretObjects field in SecretProviderClass. Instead, a SecretSync resource (secrets-store.csi.x-k8s.io/v1alpha1) is created alongside each SPC. The SecretSync controller watches the CSI-mounted files and syncs their content into a native Kubernetes Secret. Pods reference the standard K8s Secret via envFrom / valueFrom as normal.
  • Drift suppression: In google provider ≥ 7.x, secret_manager_config moved from addons_config to a top-level cluster block. The inline cluster resource now declares secret_manager_config { enabled = true } explicitly, so it is no longer in ignore_changes. addons_config remains in ignore_changes to suppress drift on legacy clusters provisioned with an older provider version. The overall effect is that spurious empty-mask PATCH requests (rejected by GKE with 400 Must specify a field to update) are avoided.
  • CSI driver readiness check: null_resource.wait_for_csi_driver_registered runs on every apply (via always_run = timestamp() trigger) and blocks until the CSIDriver object (secrets-store-gke.csi.k8s.io or the community variant) is registered and all related DaemonSets in kube-system have rolled out, ensuring no workload mounts a secrets-store volume before the kubelet has registered the driver.

F. Audit Logging

When enable_audit_logging = true, the module enables Cloud Audit Logs beyond the default ADMIN_WRITE:

  • google_project_iam_audit_config.all_services: Captures ADMIN_READ, DATA_READ, and DATA_WRITE for all GCP services used by the project.
  • Per-service overrides: Secret Manager and Cloud KMS receive explicit DATA_READ / DATA_WRITE configs so sensitive secret and key access is always logged even if allServices is tuned down elsewhere.
  • Equivalent to Services GCP enable_audit_logging = true — safe to enable at any point; does not affect running workloads. Increases Cloud Logging storage costs.
  • Key variable: enable_audit_logging (default false)

5. Traffic & Ingress

A. GKE Gateway API

Activated when enable_custom_domain = true or enable_cdn = true (use_gateway flag). Replaces the direct LoadBalancer service for external traffic.

  • Gateway class: gke-l7-global-external-managed — a Google-managed Global External L7 load balancer.
  • TLS termination: Certificate Manager certificates are provisioned per domain in application_domains and associated via a Certificate Map on the Gateway.
  • HTTPRoute: Routes traffic from the Gateway to the application Kubernetes Service.
  • GCPBackendPolicy: A GKE-native CRD (networking.gke.io/v1) attached to the Service that configures IAP, Cloud Armor security policy, and backend timeout in a single resource.
  • Static IP: When reserve_static_ip = true (default), a Global Static IP is reserved and attached to the Gateway. static_ip_name allows a custom name.
  • Cloud Deploy Gateway backend (gateway_backend_stage): When enable_cloud_deploy = true, the HTTPRoute backend must target a single stage's Service. The gateway_backend_stage variable (default "dev") controls which stage's Service and namespace the HTTPRoute points to. Change this to route external traffic to staging or prod once those stages are promoted. When enable_cloud_deploy = false, this variable is ignored and the HTTPRoute targets the base application Service directly.
  • Namespace dependency: kubernetes_manifest.gateway, kubernetes_manifest.gateway_backend_reference_grant, and kubernetes_manifest.backend_policy all declare depends_on = [kubernetes_namespace_v1.app]. This ensures the namespace exists before Gateway API CRDs are applied — without this, the Kubernetes API server rejects the manifests with a "namespace not found" error on the first apply.
  • Key variables: enable_custom_domain, application_domains, reserve_static_ip, static_ip_name, gateway_backend_stage

B. Cloud CDN

Setting enable_cdn = true activates the Gateway API path (use_gateway = true), routing all external traffic through the gke-l7-global-external-managed load balancer and reserving a Global Static IP. This is a prerequisite for enabling CDN on the underlying backend service.

Important: GCPBackendPolicy (the GKE CRD used for IAP and Cloud Armor integration) does not support a CDN configuration field. Cloud CDN must be enabled directly on the backend service created by the L7 load balancer, outside of Kubernetes resource management — via the GCP Console or gcloud compute backend-services update --enable-cdn.

  • Key variable: enable_cdn

C. Static IP Reservation

The module supports two static IP reservation paths depending on whether the Gateway API is active:

  • Gateway path (enable_custom_domain = true or enable_cdn = true): A Global static IP (google_compute_global_address) is reserved and attached to the Gateway via addresses[].value. The address name is auto-generated from the resource prefix or set via static_ip_name.
  • Direct LoadBalancer path (default, Gateway not active): A Regional static IP (google_compute_address) is reserved and assigned to the kubernetes_service_v1 LoadBalancer. The service_url output uses this IP directly (e.g. http://<ip>).
  • When reserve_static_ip = false, the LoadBalancer receives an ephemeral IP assigned by GKE; the service_url output falls back to the internal cluster URL until the IP is known.
  • Key variables: reserve_static_ip (default: true), static_ip_name

6. CI/CD & Delivery

A. Cloud Build Triggers

When enable_cicd_trigger = true, the module creates a Cloud Build v2 trigger connected to a GitHub repository:

  • GitHub connection: Authenticated via a Personal Access Token (github_token) stored in Secret Manager, or via a GitHub App installation ID (github_app_installation_id).
  • Re-apply safety: On subsequent applies where github_token is not re-provided, the module detects the existing Secret Manager secret and preserves the trigger without destroying it.
  • Trigger config: Branch pattern, included/ignored file filters, substitution variables, and trigger name are all configurable via cicd_trigger_config.
  • Service account: The cloud_build_sa service account is granted roles/container.developer on the cluster and roles/iam.serviceAccountUser to act as the GKE workload SA.

B. Cloud Deploy Pipeline

When enable_cloud_deploy = true, the module generates a full multi-stage delivery pipeline using Google Cloud Deploy:

  • Stages: Defined via cloud_deploy_stages. Default pipeline is dev → staging → prod. Each stage has its own Kubernetes namespace, Service, and Deployment.
  • Skaffold manifests: The module generates skaffold.yaml and per-stage k8s-deployment-<stage>.yaml / k8s-service-<stage>.yaml manifests automatically. Job manifests are intentionally excluded — Kubernetes Jobs are immutable and are run exclusively by Terraform to avoid RELEASE_FAILED errors.
  • Stage isolation: Each stage gets its own namespace, Kubernetes ServiceAccount, Kubernetes Secret, and per-stage SecretProviderClass (${service_name}-${stage}-secrets). Terraform provisions these before Cloud Deploy runs so that no hook-based setup is required.
  • Stage initialization jobs: When sm_backed_secret_count > 0, stage_initialization_jobs mount the Secrets Store CSI volume against the stage-specific SecretProviderClass so the Kubernetes Secret referenced by each job's secretKeyRef is materialized at pod start time. Without this mount the CSI driver never runs and pods fail with CreateContainerConfigError: secret not found.
  • Rollout ownership: kubernetes_deployment_v1.app_cd (the placeholder Deployment Cloud Deploy targets) is created with wait_for_rollout = false so Terraform does not block waiting for pods that Cloud Deploy — not Terraform — is responsible for rolling out.
  • Approval gates: require_approval = true on a stage pauses promotion until a human approves. auto_promote = true triggers automatic promotion on success.
  • Environment variables: Per-stage env vars (e.g. DB_NAME, DB_USER) are substituted with stage-namespaced values in the generated manifests.
  • Key variables: enable_cloud_deploy, cloud_deploy_stages

C. Image Mirroring

The module mirrors container images from external registries into the project's own Artifact Registry before deployment, ensuring GKE nodes pull only from within-project sources.

  • Application image mirroring: When enable_image_mirroring = true (default), the container image specified in container_image is copied into Artifact Registry using Cloud Build before the Kubernetes workload is created. This is the default path for container_image_source = "prebuilt" images.
  • Cloud SQL Auth Proxy mirroring: The proxy image (gcr.io/cloud-sql-connectors/cloud-sql-proxy:2-alpine) is unconditionally mirrored whenever a Cloud SQL server exists and an Artifact Registry repository is available (sql_server_exists && artifact_repo_id != ""). This prevents GKE nodes from ever reaching gcr.io directly.
  • VPC-SC benefit: Both mirroring paths are essential in VPC Service Controls environments where egress to external container registries is blocked at the perimeter.
  • Key variable: enable_image_mirroring (default: true)

7. Reliability & Scheduling

A. Pod Disruption Budgets

The module creates a PodDisruptionBudget to prevent Kubernetes from evicting too many pods simultaneously during voluntary disruptions (node upgrades, cluster autoscaler activity, drain operations). This is particularly important on GKE Autopilot where node management is fully automated.

  • Standard deployment: One PDB is created in the base namespace.
  • Cloud Deploy: A separate PDB is created per stage namespace.
  • pdb_min_available: Accepts an absolute count ("1", "2") or a percentage ("50%"). For single-replica workloads, "1" keeps the pod alive during node drains. For multi-replica workloads, "50%" maintains half the fleet.
  • Key variables: enable_pod_disruption_budget (default: true), pdb_min_available (default: "1")

B. Topology Spread Constraints

When enable_topology_spread = true, two TopologySpreadConstraint blocks are added to every Deployment and StatefulSet:

ConstraintTopology keymaxSkewwhenUnsatisfiable
Zone spreadtopology.kubernetes.io/zone1DoNotSchedule if topology_spread_strict = true, otherwise ScheduleAnyway
Node spreadkubernetes.io/hostname1Always ScheduleAnyway
  • Zone spread ensures pods are distributed evenly across GCP availability zones. topology_spread_strict = true blocks scheduling if the zone skew constraint cannot be satisfied — useful for strict HA requirements. The default ScheduleAnyway permits scheduling even when the constraint is violated.
  • Node spread adds a best-effort distribution across individual nodes, providing additional isolation without blocking scheduling.
  • Both constraints use app + deployment label selectors to match only pods belonging to this specific deployment.
  • Combined with enable_pod_disruption_budget (§7.A), topology spread provides proactive distribution across failure domains and reactive protection during voluntary disruptions.
  • Key variables: enable_topology_spread (default: false), topology_spread_strict (default: false)

C. Resource Quotas

When enable_resource_quota = true, a ResourceQuota is applied to every namespace managed by this module (base namespace and all Cloud Deploy stage namespaces). This prevents a single application from consuming unbounded cluster resources on a shared GKE Autopilot cluster.

Configurable limits:

VariableKubernetes fieldExample
quota_cpu_requestsrequests.cpu"4" or "4000m"
quota_cpu_limitslimits.cpu"8"
quota_memory_requestsrequests.memory"4Gi"
quota_memory_limitslimits.memory"8Gi"
quota_max_podspods"20"
quota_max_servicesservices"10"
quota_max_pvcspersistentvolumeclaims"5"

Only fields with non-empty values are included in the quota spec; omitting a field leaves it unconstrained.

D. Auto Password Rotation

When enable_auto_password_rotation = true, the module wires a fully automated database password rotation workflow via App_Common/modules/app_secrets:

  • Mechanism: A Kubernetes CronJob (the rotator) and an Eventarc trigger are provisioned. The trigger fires on Secret Manager SECRET_VERSION_ADD events and executes the rotator job, which generates a new password, updates the Cloud SQL user, and writes the new version to Secret Manager.
  • Propagation delay: rotation_propagation_delay_sec (default: 90) controls how long the rotator waits before disabling the previous secret version, giving running pods time to pick up the new credentials without connection failures.
  • Networking: The rotator job is wired to the same VPC and subnet as the Cloud SQL instance (resolved from locally computed rotation_vpc_network / rotation_vpc_subnet — not exposed as input variables) to connect via private IP.
  • Rotation schedule: The frequency is controlled by secret_rotation_period (default: "2592000s" / 30 days), which configures a Pub/Sub notification on the Secret Manager secret.
  • Validation: enable_auto_password_rotation = true with database_type = "NONE" is rejected at plan time.
  • Key variables: enable_auto_password_rotation, rotation_propagation_delay_sec, secret_rotation_period

8. Integrations

A. Redis / Memorystore

When enable_redis = true (default), the module injects Redis connection details as environment variables into the main workload, all initialization jobs, and all CronJobs.

  • REDIS_HOST: Set to redis_host if provided; falls back to the NFS server's internal IP address when redis_host is left blank. This default is convenient for development setups where Redis runs on the same VM as the NFS server.
  • REDIS_PORT: Set to redis_port (default: "6379").
  • REDIS_AUTH: When redis_auth is non-empty, the value is injected as a sensitive environment variable and also stored in the Kubernetes Secret (explicit_secret_values path) so it is never logged in Terraform output.
  • Production use: Point redis_host at a Google Cloud Memorystore for Redis private IP and set redis_auth to the instance's auth string.
  • Key variables: enable_redis (default: true), redis_host, redis_port, redis_auth (sensitive)

Note: enable_redis defaults to true. Deployments that do not use Redis should explicitly set enable_redis = false to avoid injecting unused REDIS_HOST / REDIS_PORT env vars.

B. Backup Import

When enable_backup_import = true, the module runs a db-import Kubernetes Job on terraform apply that restores a database backup into the Cloud SQL instance before the application starts.

  • Sources: backup_source = "gcs" reads the backup file from the automatically created backups GCS bucket. backup_source = "gdrive" downloads from Google Drive.
  • File: backup_file specifies the filename within the source (default: "backup.sql"). For GCS, the file must already be present in the module-created backups bucket.
  • Formats: backup_format supports sql, tar, gz, tgz, tar.gz, zip, or auto (default: "sql"). With auto, the import script detects the format from the file extension.
  • Connectivity: The import job uses the Cloud SQL Auth Proxy sidecar (--quitquitquit) for database access and authenticates using the DB_PASSWORD / ROOT_PASSWORD secrets.
  • Timeout: The job allows up to 20 minutes (create = "20m") for large database restores.
  • Key variables: enable_backup_import, backup_source, backup_file, backup_format

C. Service Mesh (ASM via Fleet)

Controlled by configure_service_mesh. Behaviour differs based on whether the module owns the GKE cluster:

  • Inline cluster (prereq_needs_gke = true): The module registers the cluster with GKE Fleet Hub (google_gke_hub_membership), enables the servicemesh Fleet feature, and configures per-cluster membership with MANAGEMENT_AUTOMATIC. Required APIs (gkehub.googleapis.com, mesh.googleapis.com, meshconfig.googleapis.com) are enabled automatically.
  • External cluster (Services GCP-provisioned): The namespace label "istio.io/rev" = "asm-managed" is applied, which signals to an already-running ASM control plane that it should inject sidecars into pods in this namespace. Full ASM control-plane management is assumed to be handled by Services GCP.
  • Key variable: configure_service_mesh (default: false)

D. Multi-Cluster Services (MCS)

Note: The enable_multi_cluster_service variable is not currently wired to any deployment resource in this module. Setting it to true has no effect on deployment. The feature is documented here for reference and future implementation.

When implemented, setting enable_multi_cluster_service = true would enable GKE Multi-Cluster Services for the application namespace by creating a Kubernetes ServiceExport resource. MCS allows Kubernetes Services to be exported from one cluster and consumed by other clusters registered in the same GKE Fleet, enabling cross-cluster service discovery without requiring an external load balancer or custom DNS.

  • Prerequisites: All participating clusters must be registered with the same GKE Fleet. Fleet registration is managed by Services GCP or by the inline ASM provisioning path in this module (§8.C).
  • DNS: Exported services become reachable at <service>.<namespace>.svc.clusterset.local from any cluster in the fleet.
  • Use case: Multi-region deployments where a service in one region must call a service in another region using a stable internal name, without traversing a public load balancer.
  • Key variable: enable_multi_cluster_service (accepted but not referenced — no effect on current deployment)

9. Inline Infrastructure Provisioning

When no Services GCP dependency is detected, the module is fully self-contained and provisions its own infrastructure stack. This mode is activated when the discovery scripts find no existing VPC, GKE cluster, NFS server, or Cloud SQL instance in the project.

Resources provisioned inline:

ResourceTerraform resourceCondition
VPC network + subnetsgoogle_compute_network, google_compute_subnetworkNo Services GCP VPC found
Cloud Router + Cloud NATgoogle_compute_router, google_compute_router_natNo Services GCP VPC found
GKE Autopilot clustergoogle_container_cluster.inline_gkeprereq_needs_gke = true
NFS server (GCE MIG)google_compute_instance_group_manager + google_compute_instance_templateNo existing NFS server found (prereq_needs_nfs = true)
Cloud SQL — PostgreSQLgoogle_sql_database_instance.inline_postgresprereq_needs_postgres = true
Cloud SQL — MySQLgoogle_sql_database_instance.inline_mysqlprereq_needs_mysql = true
ASM via Fleet Hubgoogle_gke_hub_membership, google_gke_hub_featureconfigure_service_mesh = true and inline GKE
PSA connectionnull_resource.inline_psa (via gcloud services vpc-peerings connect)No Services GCP VPC found
PSA subnet-route exportnull_resource.inline_psa_subnet_routesNo Services GCP VPC found
Service Networking service agentgoogle_project_service_identity.servicenetworking_sa, google_project_iam_member.servicenetworking_service_agentNo Services GCP VPC found (PSA connection required)

Startup sequencing: After the GKE cluster reports RUNNING, a time_sleep of 90 seconds (wait_for_gke_api) ensures the Autopilot control plane is fully reachable before any Kubernetes resources are created. This prevents transient connection refused errors during initial cluster warm-up.

Inline NFS server: Deployed as a Managed Instance Group (MIG) rather than a standalone VM, providing auto-healing and rolling replacement. Includes:

  • Instance template (google_compute_instance_template.inline_nfs_server): Ubuntu 22.04 LTS, e2-small, 10 GB boot disk + 10 GB SSD data disk, tagged nfsserver and redisserver.
  • Snapshot schedule (google_compute_resource_policy.inline_nfs_snapshot): Daily snapshots retained for 7 days.
  • Static internal IP (google_compute_address.inline_nfs_ip): Ensures the NFS IP remains stable across MIG instance replacements.
  • Auto-healing: MIG health checks on ports 2049 (NFS) and 6379 (Redis). initial_delay_sec = 720 allows time for apt-get install on constrained VMs before the first health-check failure can trigger an auto-heal cycle.
  • Firewall rules: app-allow-nfs-* (TCP 111/2049 from subnet/pod CIDRs), app-allow-nfs-hc-* (TCP 2049 from GCP health-check probers 35.191.0.0/16, 130.211.0.0/22), app-allow-redis-* (TCP 6379), and app-allow-iap-ssh-* (TCP 22 from 35.235.240.0/20) for SSH-in-browser access.

PSA service agent: GCP auto-grants roles/servicenetworking.serviceAgent when servicenetworking.googleapis.com is enabled, but the grant is asynchronous and can lag beyond the 60 s API-enablement wait on fresh projects. The module explicitly provisions google_project_service_identity.servicenetworking_sa and then grants google_project_iam_member.servicenetworking_service_agent before null_resource.inline_psa (the PSA connection resource — see below) is created, eliminating the race condition.

prereq_sql_network_self_link: A local that resolves the correct VPC self-link for inline Cloud SQL Private Service Access. When Services GCP is deployed, its discovered network is used; otherwise the inline VPC is referenced. This prevents an "invalid index" error when deploying into a Services GCP project where google_compute_network.inline_vpc is never created.

Destroy ordering: kubernetes_namespace_v1.app declares an explicit depends_on on google_container_cluster.inline_gke. This reverses on destroy so Terraform deletes the namespace (and all Kubernetes resources within it) before deleting the cluster, avoiding connection refused errors when the Kubernetes provider loses its API endpoint mid-destroy.

PSA connection: The PSA connection to servicenetworking.googleapis.com is established via null_resource.inline_psa using gcloud services vpc-peerings connect. The old google_service_networking_connection resource was replaced because the Terraform provider issues a READ on every plan, which VPC-SC perimeters block. A removed block in the code drops any lingering state entry without destroying the actual peering. A second null_resource.inline_psa_subnet_routes then calls gcloud compute networks peerings update --export-subnet-routes-with-public-ip so GKE pod CIDR routes are advertised to the Cloud SQL producer VPC.

Drift suppression on google_container_cluster.inline_gke: The resource's lifecycle.ignore_changes list covers every block modified out-of-band by GKE or not yet in the provider schema:

BlockReason
datapath_providerDataplane V2 is selected at create time and cannot be mutated safely
addons_configRetained for legacy clusters; in google provider ≥ 7.x secret_manager_config is now a top-level block declared explicitly in the resource
cluster_autoscalingGKE fills in default NAP values after creation
master_authorized_networks_configCommonly adjusted by platform operators via gcloud
gateway_api_configAuto-enabled when the Gateway API is first used
vertical_pod_autoscalingAutopilot toggles VPA status based on usage
ip_allocation_policyGCP rewrites secondary range references after cluster bootstrap

Note: secret_manager_config is not in ignore_changes — it is now declared explicitly as a top-level block (secret_manager_config { enabled = true }) on the cluster resource, aligning Terraform desired state with the live cluster.

Override variables: To target an existing instance rather than provisioning inline, supply the explicit name variables:

VariableDefaultPurpose
sql_instance_name""Use a named Cloud SQL instance directly
sql_instance_base_name"app-sql"Base name for inline SQL instance
nfs_instance_name""Use a named NFS GCE VM directly
nfs_instance_base_name"app-nfs"Base name for inline NFS VM
prereq_gke_subnet_cidr"10.201.0.0/24"Subnet CIDR for inline GKE cluster

10. Variable Reference

Key variables grouped by functional area. All variables are defined in variables.tf.

Module Metadata (Platform)

These variables are consumed by the platform UI / billing system and do not affect deployed resources.

VariableTypeDefaultDescription
module_descriptionstring(long default)Human-readable description of the module's purpose
module_documentationstring"https://docs.radmodules.dev/docs/applications/gke-app"URL to external documentation
module_dependencylist(string)["Services GCP"]Other modules that must be deployed first
module_serviceslist(string)(long list)GCP services consumed by this module
credit_costnumber100Platform credits consumed on deployment
require_credit_purchasesbooltrueEnforce credit balance check before deployment
enable_purgebooltruePermit full deletion of resources on destroy
public_accessboolfalseMake the module publicly visible in the platform catalogue

Core Identity & Project

VariableTypeDefaultDescription
project_idstringrequiredGCP project ID for all resources
application_namestring"gkeapp"Internal app name; used in resource naming
application_display_namestring"App GKE Application"Human-readable name shown in UI and dashboards
application_descriptionstring"App GKE Custom Application..."Brief description of the application's purpose
application_versionstring"1.0.0"Version tag applied to the container image
tenant_deployment_idstring"demo"Deployment environment suffix (e.g. prod, dev)
deployment_idstring""Optional deployment ID; auto-generated if empty
support_userslist(string)[]Email addresses of monitoring alert recipients
resource_labelsmap(string){}Common labels applied to all resources
fallback_regionstring"us-central1"GCP region used when no Services GCP subnet mapping can be auto-discovered
impersonation_service_accountstring""Service account email to impersonate in discovery / mirror scripts (cross-project deployments)
resource_creator_identitystring"rad-module-creator@tec-rad-ui-2b65.iam.gserviceaccount.com"Service account used by Terraform to create resources
explicit_secret_valuesmap(string){}Raw secret values provided directly by a wrapper module; bypasses plan-time Secret Manager lookups
scripts_dirstring""Path to initialisation scripts directory; defaults to module's built-in scripts

Compute & Scaling (§3.A)

VariableTypeDefaultDescription
workload_typestringnullDeployment or StatefulSetnull falls through to the module preset
min_instance_countnumber1HPA minimum replicas
max_instance_countnumber3HPA maximum replicas
container_imagestring""Container image URI
container_image_sourcestring"custom"prebuilt (use container_image directly) or custom (build from source)
container_build_configobject{enabled=true}Cloud Build configuration for custom source (Dockerfile, context, build args, repo)
container_portnumber8080Port the container listens on
container_protocolstring"http1"HTTP protocol version: http1 or h2c (gRPC). Not referenced — the protocol is hardcoded to http1 internally; changing this value has no effect on deployment. Accepted for input validation only.
container_resourcesobject{cpu_limit="1000m", memory_limit="512Mi"}CPU/memory requests and limits; also accepts ephemeral_storage_limit/request
timeout_secondsnumber300Load balancer backend timeout in seconds (0–3600)
service_typestring"LoadBalancer"Kubernetes Service type
service_annotationsmap(string){}Custom annotations applied to the Kubernetes Service
service_labelsmap(string){}Custom labels applied to the Kubernetes Service
session_affinitystring"ClientIP"Kubernetes Service session affinity: None or ClientIP
termination_grace_period_secondsnumber60Pod termination grace period in seconds (0–3600)
enable_vertical_pod_autoscalingboolfalseEnable VPA
deploy_applicationbooltrueSet false to provision infrastructure only
deployment_timeoutnumber1800Seconds Terraform waits for rollout during apply (create/update/delete)

StatefulSet Configuration (§3.A)

VariableTypeDefaultDescription
stateful_pvc_enabledboolnullEnable PVC provisioning for StatefulSet workloads
stateful_pvc_sizestringnullPVC size (e.g. "20Gi")
stateful_pvc_mount_pathstringnullMount path inside container (e.g. "/var/lib/data")
stateful_pvc_storage_classstringnullKubernetes StorageClass (e.g. "standard-rwo")
stateful_headless_serviceboolnullCreate a headless Service for stable pod DNS
stateful_pod_management_policystringnullOrderedReady or Parallel
stateful_update_strategystringnullRollingUpdate or OnDelete
stateful_fs_groupnumbernullGID set as pod-level fsGroup in the security context; Kubernetes chowns PVC mount to this GID on attach

Database (§3.B)

VariableTypeDefaultDescription
database_typestring"POSTGRES"POSTGRES, MYSQL, SQLSERVER, or NONE
application_database_namestring"gkeappdb"Cloud SQL database name
application_database_userstring"gkeappuser"Cloud SQL user
enable_cloudsql_volumebooltrueInject Cloud SQL Auth Proxy sidecar
cloudsql_volume_mount_pathstring"/cloudsql"Filesystem path inside the container where the Cloud SQL Auth Proxy Unix socket is mounted
cloud_sql_proxy_versionstring"2-alpine"Tag for the Cloud SQL Auth Proxy sidecar image (mirrored into Artifact Registry)
sql_instance_namestring""Target an existing Cloud SQL instance by name
database_password_lengthnumber32Length of the generated DB password (16–64)
db_password_env_var_namestring""Additional env var name alongside DB_PASSWORD for apps that expect a non-standard name (e.g. "WORDPRESS_DB_PASSWORD")
enable_postgres_extensionsboolfalseEnable installation of PostgreSQL extensions after DB provisioning. Not referenced in deployment resources — used for input validation only when deploying App GKE standalone. The extension flag and list flow from the application module configuration when called from a wrapper module.
postgres_extensionslist(string)[]PostgreSQL extensions to install (e.g. ['postgis', 'uuid-ossp']). Not referenced directly — the extension list is derived from the application module configuration (local.selected_module.postgres_extensions), not this variable. Setting it when deploying standalone has no effect.
enable_mysql_pluginsboolfalseEnable installation of MySQL plugins after DB provisioning. Not referenced in deployment resources — used for input validation only when deploying App GKE standalone. Plugin configuration flows from the application module when called from a wrapper module.
mysql_pluginslist(string)[]MySQL plugins to install (e.g. ['audit_log']). Not referenced directly — the plugin list is derived from the application module configuration (local.selected_module.mysql_plugins). Setting it when deploying standalone has no effect.

Storage (§3.C)

VariableTypeDefaultDescription
enable_nfsbooltrueEnable NFS (Filestore) mount
nfs_mount_pathstring"/mnt/nfs"Mount path inside container
nfs_volume_namestring"nfs-data-volume"Kubernetes Volume name for the NFS mount — override when mounting a second NFS share alongside the first
create_cloud_storagebooltrueCreate GCS buckets
storage_bucketslist(object)[]Bucket configurations
gcs_volumeslist(object)[]GCS Fuse volume mounts
manage_storage_kms_iamboolfalseCreate a CMEK KMS keyring and GCS encryption key, grant the GCS service account encrypter/decrypter, and enable CMEK on all storage buckets
enable_artifact_registry_cmekboolfalseCreate an Artifact Registry KMS key in the project CMEK keyring and enable at-rest encryption of container images

Networking (§3.D)

VariableTypeDefaultDescription
gke_cluster_selection_modestring"primary"primary, explicit, or round-robin
gke_cluster_namestring""Explicit cluster name (for explicit mode)
network_namestring""Name of the VPC network to use. Leave empty to auto-discover a single Services GCP-managed network
network_tagslist(string)["nfsserver"]Network tags applied to GKE nodes and pods for firewall targeting
enable_network_segmentationboolfalseEnable Kubernetes NetworkPolicies (also sets ADVANCED_DATAPATH on inline clusters)
namespace_namestring""Kubernetes namespace (auto-generated if empty)
prereq_gke_subnet_cidrstring"10.201.0.0/24"Subnet CIDR for the GKE subnet when a Services GCP VPC exists but no GKE cluster is present
prereq_subnet_cidr_overridestring""Override for the inline VPC primary subnet CIDR (pin to previously-applied value to avoid replacement)
prereq_gke_pod_cidr_overridestring""Override for the inline GKE pod secondary range CIDR
prereq_gke_service_cidr_overridestring""Override for the inline GKE service secondary range CIDR

Advanced Security (§4)

VariableTypeDefaultDescription
enable_cloud_armorboolfalseCreate inline Cloud Armor WAF policy
admin_ip_rangeslist(string)[]IP ranges that bypass WAF rules
cloud_armor_policy_namestring"default-waf-policy"External policy name (when not inline)
enable_iapboolfalseEnable Identity-Aware Proxy
iap_oauth_client_idstring""OAuth 2.0 client ID (sensitive)
iap_oauth_client_secretstring""OAuth 2.0 client secret (sensitive)
iap_authorized_userslist(string)[]IAP user allowlist
iap_authorized_groupslist(string)[]IAP group allowlist
enable_binary_authorizationboolfalseEnforce Binary Authorization
binauthz_evaluation_modestring"ALWAYS_ALLOW"ALWAYS_ALLOW, REQUIRE_ATTESTATION, or ALWAYS_DENY
enable_vpc_scboolfalseProvision a VPC Service Controls perimeter (see §4.D)
vpc_sc_dry_runbooltrueLog perimeter violations without blocking (flip to false to enforce)
vpc_cidr_rangeslist(string)[]Explicit VPC CIDR ranges for the perimeter access level (auto-discovered when empty)
organization_idstring""GCP Organization ID — required for folder-nested projects
enable_audit_loggingboolfalseEnable ADMIN_READ/DATA_READ/DATA_WRITE audit logs plus Secret Manager / KMS overrides

Traffic & Ingress (§5)

VariableTypeDefaultDescription
enable_custom_domainboolfalseEnable Gateway API + Certificate Manager
application_domainslist(string)[]Hostnames for TLS certificates
enable_cdnboolfalseActivate Gateway API path for CDN
reserve_static_ipbooltrueReserve a static IP address
static_ip_namestring""Custom name for the static IP

CI/CD & Delivery (§6)

VariableTypeDefaultDescription
enable_cicd_triggerboolfalseCreate Cloud Build trigger
github_repository_urlstring""GitHub repository URL
github_tokenstring""GitHub PAT for repository access (sensitive)
github_app_installation_idstring""GitHub App installation ID (preferred for org-level repos; token used only as authorizer credential)
cicd_trigger_configobject{branch_pattern="^main$"}Branch, file filters, substitutions
enable_cloud_deployboolfalseEnable Cloud Deploy pipeline
cloud_deploy_stageslist(object)[dev, staging, prod]Pipeline stage definitions
enable_image_mirroringbooltrueMirror container images to Artifact Registry

Reliability & Scheduling (§7)

VariableTypeDefaultDescription
enable_pod_disruption_budgetbooltrueCreate PodDisruptionBudget
pdb_min_availablestring"1"Minimum available pods (integer or percentage)
enable_topology_spreadboolfalseAdd zone and node spread constraints
topology_spread_strictboolfalseUse DoNotSchedule for zone constraint
enable_resource_quotaboolfalseApply ResourceQuota to all namespaces
quota_cpu_requestsstring""Total CPU requests limit
quota_cpu_limitsstring""Total CPU limits cap
quota_memory_requestsstring""Total memory requests limit
quota_memory_limitsstring""Total memory limits cap
quota_max_podsstring""Maximum pod count
enable_auto_password_rotationboolfalseEnable automated DB password rotation
rotation_propagation_delay_secnumber90Seconds to wait before disabling old secret version
secret_rotation_periodstring"2592000s"Rotation frequency (default: 30 days)

Environment Variables & Secrets

VariableTypeDefaultDescription
environment_variablesmap(string){}Plain-text environment variables injected into the GKE pod
secret_environment_variablesmap(string){}Secret Manager secret references injected as env vars via Kubernetes valueFrom
secret_propagation_delaynumber30Seconds to wait after secret creation before proceeding (allows global replication)

Observability & Health

VariableTypeDefaultDescription
health_check_configobject{enabled=true, path="/healthz"}Kubernetes liveness probe configuration
startup_probe_configobject{enabled=true, path="/healthz"}Kubernetes startup probe configuration
uptime_check_configobject{enabled=true, path="/"}Cloud Monitoring uptime check configuration
alert_policieslist(object)[]Cloud Monitoring alert policies that notify support_users when thresholds are exceeded

Backup & Maintenance

VariableTypeDefaultDescription
backup_schedulestring"0 2 * * *"Cron expression for automated database backup job
backup_retention_daysnumber7Days to retain backup files in the GCS backup bucket
enable_custom_sql_scriptsboolfalseRun custom SQL scripts from GCS after DB provisioning
custom_sql_scripts_bucketstring""GCS bucket containing the SQL scripts
custom_sql_scripts_pathstring""Path prefix within the bucket for SQL scripts
custom_sql_scripts_use_rootboolfalseExecute custom SQL scripts as the root DB user

Integrations (§8)

VariableTypeDefaultDescription
enable_redisbooltrueInject REDIS_HOST / REDIS_PORT env vars
redis_hoststring""Redis host (defaults to NFS server IP if blank)
redis_portstring"6379"Redis port
redis_authstring""Redis auth string (sensitive)
enable_backup_importboolfalseRun DB import job on apply
backup_sourcestring"gcs"gcs or gdrive
backup_filestring"backup.sql"Filename of backup to import
backup_formatstring"sql"sql, tar, gz, tgz, tar.gz, zip, or auto
configure_service_meshboolfalseEnable ASM / Fleet Hub integration
enable_multi_cluster_serviceboolfalseEnable GKE Multi-Cluster Services — not referenced in current version; has no effect

11. Outputs

Key outputs exported by the module. Full list in outputs.tf.

Service

OutputDescription
service_nameKubernetes Service name
namespaceKubernetes namespace
service_urlExternal URL (static IP / custom domain) or internal cluster URL
service_external_ipExternal LoadBalancer IP (if static IP is reserved)
service_cluster_ipClusterIP of the base Kubernetes Service
stage_service_cluster_ipsMap of ClusterIPs for stage-specific Services (when enable_cloud_deploy = true)
kubernetes_readytrue when the GKE cluster endpoint is available and all Kubernetes resources are deployed. false on the first apply of an inline cluster — the pipeline must run a second apply to deploy application resources.

Database

OutputDescription
database_instance_nameCloud SQL instance name
database_nameApplication database name
database_userApplication database user
database_hostDatabase host (127.0.0.1 via Cloud SQL Auth Proxy)
database_portDatabase port
database_password_secretSecret Manager secret name for the DB password

Infrastructure

OutputDescription
network_nameVPC network name
network_existsWhether the VPC network exists
regionsAvailable regions in the VPC
nfs_server_ipNFS server internal IP (sensitive)
nfs_mount_pathNFS mount path in containers
nfs_share_pathNFS share path on the server
storage_bucketsMap of created GCS bucket names
container_imageResolved container image URI used for the deployment
container_registryArtifact Registry repository name (when custom build is active)
deployment_idUnique deployment identifier (random hex suffix)
tenant_idTenant identifier
resource_prefixResource naming prefix
project_idGCP project ID
project_numberGCP project number
deployment_summarySummary map of key deployment attributes

Monitoring

OutputDescription
monitoring_enabledWhether monitoring is configured
monitoring_notification_channelsCloud Monitoring notification channel names

CI/CD

OutputDescription
cicd_enabledWhether the CI/CD pipeline is active
cloudbuild_trigger_nameCloud Build trigger name
cloudbuild_trigger_idCloud Build trigger ID
github_repository_urlGitHub repository URL connected for CI/CD
github_repository_ownerGitHub repository owner/organisation
github_repository_nameGitHub repository name
artifact_registry_repositoryArtifact Registry repository details (name, location, url)
cicd_configurationComplete CI/CD config map (trigger, repo, SA, image URL)

Security

OutputDescription
vpc_sc_enabledWhether the VPC-SC perimeter was successfully created
vpc_sc_perimeter_nameVPC-SC service perimeter resource name
vpc_sc_dry_run_modetrue if VPC-SC is in dry-run mode
audit_logging_enabledWhether project-level Cloud Audit Logs are enabled
artifact_registry_cmek_enabledWhether Artifact Registry CMEK encryption is configured

Jobs

OutputDescription
initialization_jobsMap of created initialization job names
cron_jobsMap of created cron job names
nfs_setup_jobNFS setup job name
db_import_jobDatabase import job name
statefulset_nameStatefulSet name (when workload_type = "StatefulSet")

12. Sensible Defaults & Configuration Consequences Reference

This section consolidates sensible starting values and the consequences of misconfiguration for every major variable group. Risk levels: Critical (data loss, full outage, or security breach), High (service unavailability or significant degradation), Medium (degraded functionality or increased cost), Low (minor operational impact).


Identity & Naming

VariableSensible DefaultRiskConsequence of Incorrect Value
tenant_deployment_id"prod" / "staging" / "dev" (match environment)CriticalDo not change after initial deployment. The value is embedded in every resource name (Cloud SQL instance, GCS buckets, secrets, service accounts). Changing it causes Terraform to destroy and recreate all named resources, resulting in data loss and a new, empty database.
application_nameShort, lowercase, hyphen-safe identifier (e.g. "myapp")CriticalDo not change after initial deployment. Embedded in Kubernetes namespace, service names, and GCP resource names. Renaming orphans existing resources and creates new ones with an empty state.
fallback_regionThe GCP region where your workloads run (e.g. "europe-west1")HighIf left as "us-central1" when your Services GCP stack is in a different region, inline infrastructure (NFS VM, Cloud SQL instance, GKE cluster) is provisioned in the wrong region. Cross-region latency increases significantly; costs rise due to cross-region egress.
deployment_id"" (auto-generate on first apply; never change)CriticalAuto-generation is safe. If you set a value manually and then change it, every resource whose name includes the deployment ID is destroyed and recreated, causing complete data loss.
support_usersList of on-call email addressesMediumEmpty list suppresses all Cloud Monitoring alert emails. Outages go unnotified until a user reports them.

Compute & Scaling

VariableSensible DefaultRiskConsequence of Incorrect Value
workload_type"Deployment" for stateless apps; "StatefulSet" for apps that require a stable filesystem identity (databases, Elasticsearch)HighUsing "Deployment" for a workload that requires persistent volume identity causes data inconsistency — pods get different PVCs on restart. Using "StatefulSet" for a stateless app adds unnecessary scheduling overhead.
min_instance_count1 for production (eliminates cold starts); 0 for dev/batch workloadsMediumSetting 0 in production causes cold-start delays (pod scheduling + image pull can take 30–90 s). Setting very high values (e.g. 10) wastes cluster resources and increases costs when traffic is low.
max_instance_count3 for most apps; size to ≤ Cloud SQL max_connections / connections_per_podHighToo low: limits throughput under load — requests queue and latency spikes. Too high: exhausts Cloud SQL connection limits (default 25 per db-user on shared-core instances), causing FATAL: sorry, too many clients already errors for all pods.
container_portMust match what the application actually listens on (e.g. 8080)CriticalPort mismatch causes Kubernetes liveness probes to fail immediately. All traffic is rejected with a connection refused error. The Deployment never becomes Ready and the pod restarts indefinitely.
container_resources.cpu_limit"1000m" to start; profile under load and adjustMediumToo low (e.g. "100m"): CPU throttling causes high request latency. Application may timeout. Too high: wastes GKE Autopilot billing units (Autopilot charges per requested resource, not per used).
container_resources.memory_limit"512Mi" minimum; increase if app uses in-memory cachingHighToo low: pod is OOMKilled (exit code 137), causing a restart loop. Kubernetes back-off extends restart intervals up to 5 minutes, causing extended outages. Too high: GKE Autopilot charges for the full requested amount.
deployment_timeout1800 (default); increase to 3600 for large images or apps with slow startupMediumToo low: Terraform times out waiting for rollout and marks the apply as failed even though the pod eventually starts. Confusing error messages. Too high: a genuinely broken deployment (e.g. wrong image tag) takes longer to surface as a failure.
termination_grace_period_seconds60 (default); increase for apps that need to drain connections (e.g. 120)MediumToo low: in-flight requests are forcibly terminated (SIGKILL) before the app can finish serving them. Results in HTTP 502 errors visible to users during rolling updates.
enable_vertical_pod_autoscalingfalse by default; enable only after load testingMediumVPA may set resource requests above what the app actually uses, increasing costs, or below what it needs, causing OOMKills. Do not enable with HPA on the same metric — the combination causes scaling oscillation.

Database

VariableSensible DefaultRiskConsequence of Incorrect Value
database_type"POSTGRES" for most apps; "NONE" if the app uses an external databaseCriticalChanging database_type after initial deployment replaces the Cloud SQL instance. All data in the existing database is lost. Never change this after the first apply unless you have a validated backup and a restore plan.
enable_cloudsql_volumetrue (use Cloud SQL Auth Proxy — the secure, recommended path)HighSetting false switches to direct TCP. The pod must reach the Cloud SQL instance's private IP directly. If firewall rules do not permit this, database connections fail silently. Auth Proxy IAM authentication is also lost, requiring password-only auth.
cloudsql_volume_mount_path"/cloudsql" (default; only change if the application framework requires a different path)CriticalIf changed without updating the application's database connection string, the app cannot find the Auth Proxy Unix socket. All database connections fail with no such file or directory or connection refused.
application_database_nameA short lowercase name tied to the app (e.g. "myappdb")CriticalDo not change after initial deployment. Renaming the database requires manual data migration. Terraform will create a new empty database with the new name, and the application will start against an empty schema, causing runtime errors.
application_database_userA short lowercase name (e.g. "myappuser")HighDo not change after initial deployment. A new database user is created; the old user retains its grants. The new user starts with no privileges until db-init is re-run. Application connections may fail until grants are applied.
database_password_length32 (minimum recommended for production)MediumValues below 16 are rejected by validation. Values between 1631 are weaker than recommended. Values above 64 are rejected by validation.
cloud_sql_proxy_version"2-alpine" (current stable)MediumUsing an old tag may expose known CVEs. Using a non-existent tag causes the mirroring job to fail, blocking the deployment.
enable_auto_password_rotationfalse initially; enable once the rotation pipeline is validatedHighEnabling with database_type = "NONE" is rejected at plan time. Enabling with too short a rotation_propagation_delay_sec can cause connection failures during rotation if pods pick up the old credential after it is disabled.
rotation_propagation_delay_sec90 (default); increase to 120180 for high-concurrency appsHighToo short: the old secret version is disabled before all running pods have restarted with the new credential. Results in authentication failures and HTTP 500 errors until the pod restarts again.

Storage

VariableSensible DefaultRiskConsequence of Incorrect Value
enable_nfstrue for apps with shared file uploads; false for fully stateless appsMediumEnabling NFS for a stateless app adds latency and unnecessary cost (NFS VM or Filestore). Disabling NFS for an app that writes shared files causes data inconsistency — each pod has its own ephemeral filesystem, and files written by one pod are not visible to others.
nfs_mount_path"/mnt/nfs" (default); must match where the application expects its shared storageHighWrong path: application writes files to the ephemeral container filesystem instead of NFS. Files survive only as long as the pod runs. After restart or scaling, previously uploaded files are not found.
nfs_volume_name"nfs-data-volume" (default; only change when using two NFS mounts)MediumChanging after initial deployment: Kubernetes treats this as a volume removal + add, causing the Deployment to rollout with a brief disruption. Always keep the same value once set.
create_cloud_storagetrue unless buckets are managed externallyMediumSetting false skips bucket provisioning. Any application code that writes to the expected bucket names will get 403 Forbidden or 404 Not Found errors if the buckets were not created externally.
manage_storage_kms_iamfalse initially; enable when CMEK is a compliance requirementMediumEnabling without the ${project_id}-cmek-keyring existing: the module auto-creates it. Disabling after buckets were encrypted: new objects cannot be written (key access revoked), existing objects cannot be read.

Networking

VariableSensible DefaultRiskConsequence of Incorrect Value
gke_cluster_selection_mode"primary" (auto-discover the single Services GCP cluster)HighUsing "explicit" without a valid gke_cluster_name causes the provider to fail at plan time with a cluster-not-found error. Using "round-robin" without multiple clusters in the same network deploys to a single cluster anyway (not an error, but misleading).
gke_cluster_name"" (auto-discover); only set when gke_cluster_selection_mode = "explicit"HighWrong cluster name in explicit mode: Terraform targets the wrong cluster, deploying the application to the wrong environment. All Kubernetes resources are created on the wrong cluster.
enable_network_segmentationfalse (start without segmentation; enable after validating NetworkPolicies)HighEnabling without understanding inter-namespace traffic patterns blocks legitimate pod-to-pod communication. Symptoms: database connections timeout, initialization jobs hang, health checks fail. Debug with kubectl describe networkpolicy -n NAMESPACE.
prereq_gke_subnet_cidr"10.201.0.0/24" (default; only relevant when inline GKE provisioning is needed)CriticalDo not change after the inline GKE cluster has been created. Changing the CIDR causes the existing subnet to be destroyed and recreated, which requires the GKE cluster to be recreated — resulting in complete data loss of all workloads on that cluster.
prereq_subnet_cidr_override"" (auto-assign; pin to the applied value if re-applying after initial deploy)CriticalIf left blank on subsequent applies when the inline VPC subnet has been created, Terraform may attempt to replace the subnet with a new auto-assigned CIDR. This destroys and recreates the GKE cluster. Always pin to the CIDR from the first apply's output.

Advanced Security

VariableSensible DefaultRiskConsequence of Incorrect Value
enable_cloud_armorfalse for internal/dev; true for production internet-facing servicesHighLeaving false in production: no WAF protection against SQLi, XSS, LFI, RCE, or DDoS. Enabling mid-deployment adds a new Global Load Balancer — ensure admin_ip_ranges includes your CI/CD and admin IPs before enabling, or valid traffic may be blocked.
admin_ip_rangesYour office VPN CIDR + CI/CD egress IPs (e.g. ["203.0.113.0/24"])HighEmpty with enable_cloud_armor = true: no bypass rule is created. If your own IP accidentally matches a WAF rule (e.g. during pentest), you will be denied (403). With enable_vpc_sc = true, an empty list is an auto-skip condition — VPC-SC perimeter is not created and a warning is emitted instead of an error.
enable_iapfalse (start open; enable IAP for internal-only or admin interfaces)HighEnabling IAP without setting iap_oauth_client_id and iap_oauth_client_secret: deployment fails. Enabling without adding authorised users/groups: no one can access the app — all requests return HTTP 403.
binauthz_evaluation_mode"ALWAYS_ALLOW" initially; promote to "REQUIRE_ATTESTATION" only after CI pipeline produces valid attestationsCriticalSetting "REQUIRE_ATTESTATION" before the CI/CD pipeline attests images: all pod deployments are blocked. The error Image is not attested appears in pod events. Setting "ALWAYS_DENY" blocks all image deployments immediately, including emergency rollbacks.
enable_vpc_scfalse initially; only enable when vpc_sc_dry_run = true has been validated with zero violationsCriticalSetting enable_vpc_sc = true with vpc_sc_dry_run = false without validating: GCP API calls from outside the perimeter are blocked. Pod pulls from external registries fail; Secret Manager calls from unapproved IPs fail. Complete service outage until perimeter is updated.
vpc_sc_dry_runtrue always until you have read and validated the Cloud Audit Logs for violationsCriticalSetting false on first enable enforces the perimeter immediately. Any unexpected traffic pattern (e.g. Cloud Build SA not in access level) results in API call denial and deployment failures.
enable_audit_loggingfalse for dev; true for productionLowEnabling increases Cloud Logging ingestion volume and cost. Disabling in production means secret access and key usage are not logged — a compliance risk for regulated environments (SOC 2, PCI-DSS, HIPAA).

Traffic & Ingress

VariableSensible DefaultRiskConsequence of Incorrect Value
enable_custom_domainfalse until DNS is configured; then trueMediumEnabling before DNS records point to the static IP: SSL certificate provisioning fails. Certificate Manager will continuously retry and the domain will be inaccessible via HTTPS until DNS is correctly pointed at the Gateway IP.
application_domainsYour validated domain names (e.g. ["app.example.com"])HighTypos in domain names cause SSL certificate provisioning for the wrong domain. The correct domain has no certificate and HTTPS traffic fails. Certificates for misspelled domains are wasted but cannot be automatically cleaned up.
reserve_static_iptrue (default; strongly recommended)MediumSetting false: Cloud Run/GKE receives an ephemeral IP. DNS A records cannot be reliably set. The IP changes on each deployment, breaking bookmarks, allowlists, and cached DNS records.
gateway_backend_stage"dev" initially; change to "staging" or "prod" as pipeline maturesHighSetting to a stage that does not yet exist (e.g. "prod" before prod is promoted to): the HTTPRoute backend has no valid endpoint. All external traffic receives HTTP 502 or 404.

CI/CD & Delivery

VariableSensible DefaultRiskConsequence of Incorrect Value
enable_image_mirroringtrue (default; strongly recommended)HighSetting false: GKE nodes pull directly from external registries (Docker Hub, GitHub Container Registry). Affected by pull rate limits (429 errors during deployments). In VPC-SC environments, external registry access is blocked — pods fail to start with ErrImagePull.
cicd_trigger_config.branch_pattern"^main$" for a single production branch; "^(main|develop)$" for GitFlowMediumWrong regex: trigger never fires (builds must be run manually). Overly broad regex (e.g. ".*"): every branch push triggers a build and deployment, including feature branches — possible accidental production deployments.
enable_cloud_deployfalse until a proper staging pipeline is neededMediumEnabling enable_cloud_deploy without enable_cicd_trigger: no automated trigger creates releases. Releases must be created manually via gcloud deploy releases create. The pipeline exists but is never fed automatically.
cloud_deploy_stagesDefault [dev, staging, prod] with require_approval = true on prodCriticalSetting require_approval = false on prod allows any successful staging build to automatically promote to production without human review. Combine with auto_promote = true and a broken build reaches production automatically.

Reliability & Scheduling

VariableSensible DefaultRiskConsequence of Incorrect Value
enable_pod_disruption_budgettrue (default; always keep enabled in production)HighSetting false: during GKE Autopilot node upgrades, Kubernetes may evict all pods simultaneously. If min_instance_count = 1, the single pod is evicted and the application is completely unavailable during the upgrade window (can last minutes).
pdb_min_available"1" for single-replica; "50%" for multi-replica (3+)HighSetting "1" with max_instance_count = 1: GKE cannot drain the node for upgrades because evicting the only pod would violate the budget. Node upgrade is indefinitely blocked — cluster falls behind on security patches. Set "0" for single-replica workloads where a brief upgrade disruption is acceptable.
enable_topology_spreadfalse initially; enable for production HA deployments with min_instance_count ≥ 3MediumEnabling with topology_spread_strict = true and min_instance_count < 3: pods cannot be scheduled across 3 zones with a max skew of 1. Pods remain Pending indefinitely. Check with kubectl describe pod for FailedScheduling events referencing topology constraints.
enable_resource_quotafalse for standalone deployments; true for shared multi-tenant clustersMediumEnabling with quota values lower than what the workload requests: pods fail to schedule with exceeded quota events. Initialization jobs and cron jobs also fail. Always set quota_cpu_limits and quota_memory_limits ≥ the sum of all containers' limits in the namespace.
backup_schedule"0 2 * * *" (daily at 02:00 UTC; adjust to a low-traffic window in your timezone)LowRunning backup during peak traffic increases DB load. Using "*/5 * * * *" (every 5 min) creates excessive backup files and storage costs.
backup_retention_days7 for dev; 30 for productionMedium1 or 0: almost no recovery window. A mistake is only reversible within 24 hours. 365+: storage costs grow unboundedly. Strike a balance based on your RPO requirements.

Integrations

VariableSensible DefaultRiskConsequence of Incorrect Value
enable_redisfalse if the application does not use Redis (avoid injecting unused env vars)MediumLeaving true for a non-Redis application: REDIS_HOST defaults to the NFS server IP. If the NFS VM has no Redis service, the application may log connection errors on startup. Set false explicitly for apps that do not use Redis.
redis_hostThe private IP of your Cloud Memorystore instance (e.g. "10.0.0.5")HighWrong IP: all Redis GET/SET operations fail. Session data is lost, cache misses spike, and any queue-backed async tasks stop processing. If the application falls back to database-backed caching, DB load increases significantly.
redis_port"6379" (default; only change if Memorystore is configured on a non-standard port)MediumWrong port: Redis client receives connection refused. Same impact as wrong redis_host.
redis_authThe Memorystore auth string (leave "" only for private-VPC dev instances)MediumSetting "" when the Redis instance has AUTH enabled: all connections are rejected with NOAUTH Authentication required. The application cannot use Redis at all. Setting a wrong value: same outcome.
enable_backup_importfalse (default; only set true on the apply where you want to restore)HighLeaving true after a successful import: the import job runs on every subsequent terraform apply, potentially overwriting live data with stale backup data. Set back to false immediately after a successful restore.
backup_format"sql" for plain SQL dumps; "auto" when file extension is reliableHighWrong format (e.g. "sql" for a gzipped dump): import fails with a parse error. The "auto" setting detects format from file extension and is safe for most cases.
configure_service_meshfalse (default); only enable when ASM is required for mTLS or traffic managementMediumEnabling on a project without the required Fleet/ASM APIs enabled: the apply fails with API not enabled errors. Enabling without existing ASM infrastructure in a shared cluster: Istio sidecar injection label is applied but no control plane serves it — pods start but sidecar containers remain in ContainerCreating.

Configuration Pitfalls & Sensible Defaults

Risk levels: Critical (data loss, full outage, security breach) — High (service unavailable or significant degradation) — Medium (degraded function or increased cost) — Low (minor impact).

VariableSensible DefaultRiskConsequence of Incorrect Value
application_nameShort, lowercase, hyphen-safe (e.g. "myapp", "payments-api")CriticalEmbedded in every GCP and Kubernetes resource name (namespace, services, secrets, SQL instance, GCS buckets). Never change after first deploy — Terraform destroys and recreates all named resources, causing complete data loss and a fresh empty database.
tenant_deployment_idMatch environment: "prod", "staging", "dev"CriticalEmbedded in all resource names alongside application_name. Never change after first deploy — same outcome as changing application_name: full resource recreation, data loss, and a new empty deployment running beside the orphaned old one.
deployment_id"" (let the platform auto-generate; never set manually)CriticalIf set manually and later changed, every resource whose name includes this suffix is destroyed and recreated. Auto-generated values are stable across applies.
quota_memory_requests / quota_memory_limitsBinary unit suffix required: "4Gi", "8192Mi" — never a bare integerCriticalA bare integer such as "4" is interpreted by Kubernetes as 4 bytes, not 4 gibibytes. The ResourceQuota is created successfully but every pod is immediately rejected at scheduling time with exceeded quota: requests.memory. All pod scheduling is blocked — the namespace is effectively dead until the quota is corrected and the ResourceQuota is updated.
stateful_pvc_enabled + workload_typestateful_pvc_enabled = true automatically resolves to StatefulSet; do not also set workload_type = "Deployment"Criticalstateful_pvc_enabled = true with workload_type = "Deployment" fails validation at plan time with an explicit error. Terraform refuses to apply. Setting only stateful_pvc_enabled = true (without workload_type) correctly resolves to StatefulSet.
container_portMust exactly match what the application server binds to (e.g. 8080, 3000, 5000)CriticalPort mismatch: Kubernetes liveness and readiness probes fail immediately on every pod. All pods enter a CrashLoopBackOff restart loop. The Deployment never reaches Ready. Service is offline.
prereq_gke_subnet_cidr"10.201.0.0/24" (default) — pin to the applied value and never change itCriticalChanging after the inline GKE cluster has been created: the existing subnet is destroyed and recreated with the new CIDR, which requires destroying and recreating the GKE cluster. All workloads on the cluster are lost. CIDR ranges must be unique across all GKE deployments in the project — overlapping CIDRs cause PSA/VPC peering failures.
prereq_subnet_cidr_override"" initially; pin to the output CIDR value from the first apply on all subsequent appliesCriticalLeft blank on subsequent applies when the inline VPC subnet already exists: Terraform may auto-assign a new CIDR, triggering subnet replacement which forces GKE cluster recreation and complete data loss. Always record and pin the CIDR from the first apply's output.
session_affinity"ClientIP" for stateful applications requiring sticky sessionsHigh"None" for a stateful app (e.g. one storing session data in memory): each request may be routed to a different pod. Users experience random session loss — logged out, lost shopping cart, broken multi-step wizards. Default is "ClientIP" which is correct for most stateful apps.
enable_iap + iap_oauth_client_id / iap_oauth_client_secretBoth iap_oauth_client_id and iap_oauth_client_secret must be pre-created OAuth 2.0 credentialsCriticalenable_iap = true without valid iap_oauth_client_id and iap_oauth_client_secret: the GCPBackendPolicy deployment fails. The Gateway backend configuration is broken and no traffic is routed. IAP on GKE requires pre-created OAuth credentials — unlike Cloud Run IAP, these cannot be auto-generated by the module.
enable_iap + iap_authorized_users / iap_authorized_groupsAlways populate at least one user or group before enabling IAPCriticalenable_iap = true with empty user and group lists: 100% of requests return HTTP 403. The service is deployed and reachable but no identity is granted access. Update IAM before enabling.
enable_cloud_armorfalse for internal/dev; true for production internet-facing servicesHighEnabling mid-deployment with admin_ip_ranges empty: no bypass rule is created. If your own office IP happens to trigger a WAF rule (e.g. during a penetration test), you are denied. In VPC-SC mode, empty admin_ip_ranges causes VPC-SC provisioning to be silently skipped.
binauthz_evaluation_mode"ALWAYS_ALLOW" until CI pipeline is producing valid attestations; then "REQUIRE_ATTESTATION"Critical"REQUIRE_ATTESTATION" before the CI/CD pipeline attests images: all pod deployments are blocked with Image is not attested in pod events. Emergency rollbacks are also blocked. "ALWAYS_DENY" blocks all image deployments unconditionally including hotfixes.
enable_vpc_sc + vpc_sc_dry_runAlways enable with vpc_sc_dry_run = true first; audit logs for violations before enforcingCriticalvpc_sc_dry_run = false on first enable without validating: any unrecognised principal or IP is blocked immediately. GKE node pulls from Artifact Registry, Cloud SQL Auth Proxy calls, and Secret Manager reads can all fail simultaneously if access levels are incomplete.
organization_idRequired for VPC-SC on folder-nested projects — auto-discovery is disabled for folder-nested projectsHighFolder-nested project without explicit organization_id: VPC-SC perimeter is silently skipped with a warning. No error is raised, but no security perimeter is created.
min_instance_count + max_instance_countminmax; min = 1 for production; ensure max × connections_per_pod ≤ Cloud SQL max_connectionsHighmin > max: HPA is invalid and the deployment cannot scale. max too high: exhausts Cloud SQL connection pool (FATAL: sorry, too many clients). min = 0 for stateful apps: scale-to-zero destroys in-memory state.
pdb_min_available + min_instance_count"1" for multi-replica; "0" for single-replica workloads where brief downtime is acceptableHighpdb_min_available = "1" with a single-replica deployment: GKE cannot drain the node for upgrades because evicting the only pod violates the budget. Node upgrades are permanently blocked — the cluster falls behind on security patches and eventually becomes unschedulable.
enable_resource_quota + quota_memory_* / quota_cpu_*Set quota values ≥ sum of all containers' requests in the namespaceHighQuota values lower than the workload's aggregate resource requests: pods fail to schedule with exceeded quota events. Initialization jobs and cron jobs also fail. The namespace is unusable until quotas are raised.
enable_cloudsql_volumetrue (inject the Cloud SQL Auth Proxy sidecar) — set false only for apps without Cloud SQLHighfalse for an app that uses Cloud SQL: the pod must reach the Cloud SQL private IP directly over TCP. If firewall rules do not permit this, all database connections fail at runtime. IAM-based proxy authentication is also lost.
cloud_sql_proxy_version"2-alpine" (current stable) — do not pin to a very old tagMediumVery old tag: known CVEs in the proxy binary. Non-existent tag: the mirroring job fails during apply, blocking the entire deployment. Default "2-alpine" always resolves to the latest stable 2.x release.
enable_auto_password_rotation + rotation_propagation_delay_secEnable only after validating the rotation pipeline; rotation_propagation_delay_sec90HighToo short a rotation_propagation_delay_sec: the old credential is revoked before all pods restart with the new version. The connection pool is exhausted and HTTP 500 errors appear until every pod completes a restart cycle.
secret_rotation_period"2592000s" (30 days) — must include the s suffixHighA value without the s suffix (e.g. "2592000") is rejected by Secret Manager at apply time. The secret rotation notification is not registered and automatic rotation never fires.
enable_backup_importfalse (default) — set true only on the one apply where you want to restoreCriticalLeaving enable_backup_import = true after a successful restore: the import Kubernetes Job re-runs on every terraform apply, overwriting the live database with the stale backup file. Set back to false immediately after a successful restore.
enable_topology_spread + topology_spread_stricttopology_spread_strict = false unless you have ≥ 3 replicas spread across 3 zonesMediumtopology_spread_strict = true with fewer than 3 replicas or fewer than 3 available zones: new pods cannot be scheduled (FailedScheduling — topology constraint cannot be satisfied). Existing pods remain running but any replacement pods are permanently Pending.
enable_network_segmentationfalse initially; enable only after mapping all inter-namespace traffic flowsHighEnabling without understanding traffic patterns: Kubernetes NetworkPolicies block legitimate pod-to-pod communication. Symptoms: database connections time out, initialization jobs hang, health probes fail. Debugging requires kubectl describe networkpolicy -n NAMESPACE.
workload_type"Deployment" for stateless apps; "StatefulSet" for apps requiring stable pod identity or persistent volumesHigh"Deployment" for a workload needing persistent volume identity: pods receive different PVCs on each restart, causing data inconsistency or loss. "StatefulSet" for a truly stateless app: adds unnecessary scheduling constraints and slower rolling update behaviour.
kubernetes_ready outputCheck this output before depending on Kubernetes resources in a pipelineHighOn the first apply of an inline cluster, kubernetes_ready = false — the Kubernetes API is not yet reachable and all Kubernetes resources are excluded from the plan. A second apply is required to deploy application resources. Skipping the second apply leaves infrastructure provisioned but no application deployed.
enable_image_mirroringtrue (default; strongly recommended)Highfalse: GKE nodes pull directly from external registries. Docker Hub rate limits cause ErrImagePull during peak deployments. In VPC-SC environments, external registry access is blocked — all pod starts fail.
gke_cluster_selection_mode + gke_cluster_name"primary" (auto-discover) unless you have multiple clusters; "explicit" only with a verified gke_cluster_nameHigh"explicit" with an invalid or wrong cluster name: Terraform targets the wrong cluster. All Kubernetes resources (namespace, deployments, secrets) are deployed to the wrong environment silently.
cloud_deploy_stages prod stagerequire_approval = true on the production stageCriticalrequire_approval = false with auto_promote = true on prod: a successful staging rollout automatically promotes to production without human review. A broken migration or bad image tag reaches production automatically.
configure_service_meshfalse until ASM is confirmed operational in the clusterMediumEnabling without required Fleet/ASM APIs (gkehub.googleapis.com, mesh.googleapis.com): apply fails with API not enabled. Enabling on a shared cluster where ASM control plane is not running: Istio sidecar injection label is applied but sidecar containers remain in ContainerCreating indefinitely — all pods appear degraded.