Skip to main content

Ollama GKE Module — Configuration Guide

Ollama is an open-source LLM inference server that serves large language models such as Llama, Mistral, Gemma, and Phi via a REST API on port 11434. This module deploys Ollama on GKE Autopilot as a Kubernetes Deployment with model weights persisted to a GCS Fuse volume so that pod restarts load models from storage rather than re-downloading them.

Ollama GKE is a wrapper module built on top of App GKE. It delegates all GCP infrastructure provisioning to App GKE (GKE cluster, networking, GCS, Secret Manager, CI/CD) and uses an Ollama Common sub-module to supply Ollama-specific application configuration, the GCS models bucket, and the optional model-pull initialization job. The Ollama Common outputs feed into App GKE's application_config, module_storage_buckets, and scripts_dir inputs.

Ollama GKE is designed as a shared in-cluster AI inference endpoint. Any pod in the same cluster namespace calls the API via the internal ClusterIP service URL http://ollama.<namespace>.svc.cluster.local:11434. For CPU-only, serverless inference use Ollama CloudRun.


§1 · Module Overview

What Ollama GKE provides

  • An Ollama Kubernetes Deployment (prebuilt image ollama/ollama, mirrored to Artifact Registry when enable_image_mirroring = true) with a ClusterIP service on port 11434.
  • A GCS bucket (<resource_prefix>-models) mounted via GCS Fuse CSI driver at /mnt/gcs. OLLAMA_MODELS is set to /mnt/gcs/ollama/models so weights persist across pod restarts.
  • An optional model-pull Kubernetes Job that starts a local Ollama server in the background, pulls the specified model, and stores it in the GCS bucket. Runs when default_model is non-empty and initialization_jobs = [].
  • Horizontal Pod Autoscaler (HPA) between min_instance_count and max_instance_count.
  • No database, no Redis — Ollama is stateless beyond its GCS-backed model cache.

Key differences from App GKE defaults

FeatureApp GKE defaultOllama GKE default
container_port808011434 (set by Ollama Common)
container_resources.cpu_limit"1000m""8"
container_resources.memory_limit"512Mi""16Gi"
container_resources.cpu_request"500m""4"
container_resources.mem_request"256Mi""8Gi"
min_instance_count11
max_instance_count33
service_type"ClusterIP""ClusterIP"
enable_redisvariesalways false (hard-coded)
DatabasevariesNONE (via database_type)
GCS models bucketnoneauto-provisioned via Ollama Common
Model-pull jobnoneauto-generated when default_model is set
Auto-injected env varsnoneOLLAMA_MODELS, OLLAMA_HOST, OLLAMA_KEEP_ALIVE

§2 · IAM & Project Identity

VariableTypeDefaultDescription
project_idstringrequiredGCP project into which all resources are deployed.
tenant_deployment_idstring"demo"Short suffix appended to resource names. 1–20 lowercase letters, numbers, hyphens.
resource_creator_identitystring"rad-module-creator@tec-rad-ui-2b65.iam.gserviceaccount.com"Service account used by Terraform.
support_userslist(string)[]Email addresses granted IAM access and monitoring alert recipients.
resource_labelsmap(string){}Labels applied to all module-managed resources.
deployment_regionstring"us-central1"GCP region fallback when network discovery cannot determine region from VPC subnets. Also used as the GCS bucket region.
module_descriptionstring(Ollama GKE description)Platform UI description.
module_documentationstring"https://docs.radmodules.dev/docs/modules/Ollama_GKE"External documentation URL.
module_dependencylist(string)["Services GCP"]Modules that must be deployed before this one.
module_serviceslist(string)(GCP service list)GCP services consumed by this module.
credit_costnumber150Platform credits consumed on deployment.
require_credit_purchasesboolfalseEnforce credit balance check before deployment.
enable_purgebooltruePermit full deletion of all module resources on destroy.
public_accessbooltrueControls platform UI visibility.
deployment_idstring""Optional fixed deployment ID. Auto-generated when blank.

§3 · Core Service Configuration

§3.A · Application Identity (Group 2)

VariableTypeDefaultDescription
application_namestring"ollama"Base name for Kubernetes resources and GCS bucket. Do not change after initial deployment.
application_display_namestring"Ollama LLM Server"Human-readable name shown in the platform UI.
application_descriptionstring"Ollama — standalone open-source LLM inference server on GKE..."Brief description surfaced in Kubernetes annotations.
application_versionstring"latest"Ollama Docker image tag. Use a pinned tag in production.

§3.B · Ollama Model Configuration (Group 18)

VariableTypeDefaultDescription
default_modelstring""Model to pull on first deployment. Examples: "llama3.2:3b", "mistral", "phi3:mini". Leave empty to skip the auto-pull job.
model_pull_timeout_secondsnumber3600Timeout for the model-pull init job. Large models take 20–30 minutes on first pull. Valid range: 300–7200.

When default_model is set and initialization_jobs is empty, a Kubernetes Job named model-pull is created automatically using the scripts/model-pull.sh script from Ollama Common. The job mounts the ollama-models GCS volume so pulled weights persist.

§3.C · Runtime & Scaling (Group 3)

VariableTypeDefaultDescription
deploy_applicationbooltrueSet false to provision storage and IAM without deploying the Kubernetes workload.
workload_typestring"Deployment"Kubernetes workload type. "Deployment" is recommended for GCS-backed Ollama. Use "StatefulSet" only when local PVC storage is preferred over GCS.
container_resourcesobject{ cpu_limit="8", memory_limit="16Gi", cpu_request="4", mem_request="8Gi" }Container CPU and memory configuration. For 3B models: cpu_limit="4", memory_limit="8Gi". For 7B models: cpu_limit="8", memory_limit="16Gi".
min_instance_countnumber1Minimum pod replicas. 1 keeps a warm instance for low-latency inference.
max_instance_countnumber3Maximum pod replicas for HPA.
timeout_secondsnumber300Kubernetes pod termination grace period seconds. Increase for long inference requests. Valid range: 0–3600.
termination_grace_period_secondsnumber60Seconds Kubernetes waits before force-killing the pod after a SIGTERM.
service_typestring"ClusterIP"Kubernetes Service type. "ClusterIP" keeps the API internal (recommended). "LoadBalancer" for external access only. Options: ClusterIP, LoadBalancer, NodePort.
session_affinitystring"None"Session affinity for the Kubernetes Service. "ClientIP" routes all requests from the same client IP to the same pod.
enable_image_mirroringbooltrueMirror ollama/ollama to Artifact Registry to avoid Docker Hub rate limits.
enable_vertical_pod_autoscalingboolfalseEnable VPA to automatically adjust CPU/memory requests. Recommended for GKE Autopilot.
container_image_sourcestring"prebuilt"Image source: "prebuilt" uses ollama/ollama directly; "custom" triggers a Cloud Build.
container_imagestring"ollama/ollama"Full container image URI when container_image_source = "prebuilt".
container_build_configobject{ enabled=false }Cloud Build configuration when container_image_source = "custom".
container_protocolstring"http1"HTTP protocol version.
service_annotationsmap(string){}Custom annotations applied to the Kubernetes service.
service_labelsmap(string){}Custom labels applied to the Kubernetes service.
enable_cloudsql_volumeboolfalseNot needed for Ollama. Required by the App GKE interface.
cloudsql_volume_mount_pathstring"/cloudsql"Cloud SQL Auth Proxy socket path. Present for interface compatibility.

§3.D · Automatically Injected Environment Variables

The following environment variables are set automatically by Ollama Common:

VariableValuePurpose
OLLAMA_MODELS"/mnt/gcs/ollama/models"Points Ollama at the GCS Fuse subdirectory for model persistence.
OLLAMA_HOST"0.0.0.0:11434"Binds to all interfaces so the Kubernetes service can forward traffic.
OLLAMA_KEEP_ALIVE"24h"Keeps loaded model resident in memory between requests.

§3.E · Environment Variables & Secrets (Group 5)

VariableTypeDefaultDescription
environment_variablesmap(string){}Additional plain-text env vars. The three Ollama vars above are injected automatically.
secret_environment_variablesmap(string){}Map of env var name → Secret Manager secret name.
secret_propagation_delaynumber30Seconds to wait after secret creation. Valid range: 0–300.
secret_rotation_periodstring"2592000s"Rotation notification period (30 days). Set null to disable.
enable_auto_password_rotationboolfalseNot applicable for Ollama (no database).
rotation_propagation_delay_secnumber90Seconds to wait after rotation before restarting pods.

§3.F · Access & Networking (Group 4)

VariableTypeDefaultDescription
enable_iapboolfalseEnable Identity-Aware Proxy.
iap_authorized_userslist(string)[]Users granted access through IAP.
iap_authorized_groupslist(string)[]Google Groups granted access through IAP.
iap_oauth_client_idstring""OAuth 2.0 client ID for IAP (required for GKE IAP).
iap_oauth_client_secretstring""OAuth 2.0 client secret for IAP. Sensitive.
iap_support_emailstring""Support email shown on the IAP consent screen.
enable_cloud_armorboolfalseEnable Cloud Armor WAF.
cloud_armor_policy_namestring""Name of an existing Cloud Armor security policy.
application_domainslist(string)[]Custom domain names for the load balancer.
enable_custom_domainboolfalseConfigure a custom domain for the application.
enable_cdnboolfalseEnable Cloud CDN on the load balancer.
reserve_static_ipboolfalseReserve a static external IP for the load balancer.
static_ip_namestring""Name of the reserved static IP. Auto-generated when empty.
enable_vpc_scboolfalseEnable VPC Service Controls perimeter enforcement.
vpc_cidr_rangeslist(string)[]VPC subnet CIDR ranges for VPC-SC.
vpc_sc_dry_runbooltrueLog VPC-SC violations without blocking.
organization_idstring""GCP Organization ID for VPC-SC policy.
enable_audit_loggingboolfalseEnable detailed Cloud Audit Logs.
admin_ip_rangeslist(string)[]CIDR ranges for administrative access.

§4 · GKE-Specific Configuration (Group 15)

VariableTypeDefaultDescription
gke_cluster_namestring""Name of an existing GKE Autopilot cluster. Uses the Services GCP-managed cluster when empty.
namespace_namestring""Kubernetes namespace for the Ollama deployment. Auto-generated from the application name when empty.
enable_pod_disruption_budgetbooltrueCreate a PodDisruptionBudget to maintain availability during node upgrades.
pdb_min_availablenumber1Minimum pods available during voluntary disruptions.
enable_network_segmentationboolfalseApply Kubernetes NetworkPolicies to restrict pod-to-pod traffic.
configure_service_meshboolfalseConfigure Anthos Service Mesh (Istio) for traffic management and mTLS.
network_tagslist(string)[]GCP network tags applied to GKE nodes for firewall rule matching.
deployment_timeoutnumber600Seconds to wait for the Kubernetes Deployment to become ready.
enable_resource_quotaboolfalseApply Kubernetes ResourceQuota to the namespace.

§5 · StatefulSet Settings (Group 16)

These settings apply only when workload_type = "StatefulSet". The default "Deployment" workload type with GCS Fuse is recommended and these are not needed.

VariableTypeDefaultDescription
stateful_pvc_enabledboolfalseProvision a PVC for local model storage. Not required when using GCS Fuse.
stateful_pvc_sizestring"50Gi"PVC size (e.g. "100Gi" for multiple large models).
stateful_pvc_mount_pathstring"/mnt/data"Container path for the PVC.
stateful_pvc_storage_classstring"standard-rwo"Kubernetes StorageClass for the PVC.
stateful_headless_serviceboolfalseCreate a headless service for the StatefulSet.
stateful_pod_management_policystring"OrderedReady"Pod management: "OrderedReady" or "Parallel".
stateful_update_strategystring"RollingUpdate"StatefulSet update strategy.

§6 · Storage & Filesystem (Group 10)

VariableTypeDefaultDescription
create_cloud_storagebooltrueProvision GCS buckets in storage_buckets. The models bucket is always created.
storage_bucketslist(object)[]Additional GCS buckets beyond the Ollama models bucket.
enable_nfsboolfalseProvision and mount Cloud Filestore NFS. Not required for Ollama.
nfs_mount_pathstring"/mnt/nfs"Filesystem path for the NFS volume.
nfs_instance_namestring""Name of an existing NFS GCE VM. Auto-discovered when empty.
nfs_instance_base_namestring"app-nfs"Base name for an inline NFS GCE VM.
gcs_volumeslist(object)[]Additional GCS buckets to mount as GCS Fuse volumes. The ollama-models bucket is always appended.
manage_storage_kms_iamboolfalseCreate a CMEK KMS key for GCS encryption.
enable_artifact_registry_cmekboolfalseEnable CMEK encryption for Artifact Registry.

GCS volume layout:

<resource_prefix>-models/          ← GCS bucket root
└── ollama/
└── models/ ← /mnt/gcs/ollama/models (OLLAMA_MODELS)

§7 · Backup & Maintenance (Group 6)

Ollama has no database — backup settings are present for App GKE interface compatibility only.

VariableTypeDefaultDescription
backup_schedulestring""Not applicable for Ollama.
backup_retention_daysnumber7Days to retain backup files.
enable_backup_importboolfalseNot applicable for Ollama.
backup_sourcestring"gcs""gcs" or "gdrive".
backup_uristring""Location of the backup file.
backup_formatstring"sql"Format: sql, tar, gz, tgz, tar.gz, zip, auto.

§8 · CI/CD Integration (Group 7)

VariableTypeDefaultDescription
enable_cicd_triggerboolfalseCreate a Cloud Build trigger on GitHub pushes.
github_repository_urlstring""Full HTTPS URL of the GitHub repository.
github_tokenstring""GitHub Personal Access Token. Sensitive.
github_app_installation_idstring""Cloud Build GitHub App installation ID.
cicd_trigger_configobject{ branch_pattern = "^main$" }Branch filter, trigger name, and build substitutions.
enable_cloud_deployboolfalseSwitch to a Cloud Deploy pipeline.
cloud_deploy_stageslist(object)[dev, staging, prod(approval)]Ordered promotion stages.
enable_binary_authorizationboolfalseEnforce Binary Authorization for signed container images.

§9 · Custom Initialization & Jobs (Group 8)

VariableTypeDefaultDescription
enable_custom_sql_scriptsboolfalseNot applicable for Ollama.
custom_sql_scripts_bucketstring""GCS bucket containing SQL scripts.
custom_sql_scripts_pathstring""Path prefix within the bucket.
custom_sql_scripts_use_rootboolfalseExecute scripts as root database user.
initialization_jobslist(object)[]Kubernetes Jobs executed once during deployment. When non-empty, overrides the auto-generated model-pull job.
cron_jobslist(object)[]Recurring Kubernetes CronJobs.
additional_serviceslist(any)[]Additional containers deployed as separate Kubernetes Deployments with ClusterIP services (e.g. a Qdrant vector database alongside Ollama).

§10 · Database Backend (Group 11)

Ollama has no database dependency. Redis is also disabled for this module.

VariableTypeDefaultDescription
database_password_lengthnumber32Not used. Present for interface compatibility. Valid range: 16–64.
enable_postgres_extensionsboolfalseNot applicable for Ollama.
postgres_extensionslist(string)[]Not applicable for Ollama.
enable_mysql_pluginsboolfalseNot applicable for Ollama.
mysql_pluginslist(string)[]Not applicable for Ollama.
database_typestring"NONE"Always "NONE" for Ollama — no Cloud SQL instance is provisioned.

Hard-coded values (not user-configurable):

  • enable_redis = false

§11 · Observability & Health (Group 13)

Ollama's root endpoint (/) returns "Ollama is running" once the server is ready.

VariableTypeDefaultDescription
startup_probeobject{ enabled=true, type="HTTP", path="/", initial_delay_seconds=30, timeout_seconds=5, period_seconds=15, failure_threshold=20 }Startup probe forwarded through Ollama Common to the container spec. The 20-attempt threshold allows up to ~5 minutes for model loading from GCS.
liveness_probeobject{ enabled=true, type="HTTP", path="/", initial_delay_seconds=60, timeout_seconds=5, period_seconds=30, failure_threshold=3 }Liveness probe. 60 s initial delay avoids false restarts during model loading.
startup_probe_configobject{ enabled=true }Structured startup probe passed directly to App GKE (300 s timeout by default).
health_check_configobject{ enabled=true, initial_delay_seconds=60 }Structured liveness probe passed directly to App GKE.
uptime_check_configobject{ enabled=true, path="/", check_interval="60s", timeout="10s" }Cloud Monitoring uptime check.
alert_policieslist(object)[]Cloud Monitoring alert policies.

§12 · Outputs

OutputDescription
service_nameKubernetes service name.
service_urlService URL (LoadBalancer or ClusterIP).
namespaceKubernetes namespace containing the Ollama deployment.
ollama_cluster_urlInternal Kubernetes URL for the Ollama API: http://<service_name>.<namespace>.svc.cluster.local:11434. Other pods in the same cluster call this URL.
service_cluster_ipClusterIP of the Kubernetes service.
stage_service_cluster_ipsMap of ClusterIPs for stage-specific Kubernetes services (Cloud Deploy).
service_external_ipExternal LoadBalancer IP when reserve_static_ip = true.
models_bucketGCS bucket name where Ollama model weights are persisted.
storage_bucketsAll provisioned GCS buckets.
network_nameVPC network name.
network_existsWhether the VPC network exists.
regionsAvailable regions in the VPC.
container_imageContainer image URI.
container_registryArtifact Registry repository name.
deployment_idUnique deployment identifier.
tenant_idTenant identifier.
resource_prefixResource naming prefix.
project_idGCP project ID.
project_numberGCP project number.
monitoring_enabledWhether Cloud Monitoring is configured.
monitoring_notification_channelsMonitoring notification channel names.
uptime_check_namesUptime check names (returns [] for GKE).
initialization_jobsCreated initialization job names.
cron_jobsCreated cron job names.
statefulset_nameStatefulSet name when workload_type = "StatefulSet".
nfs_server_ipNFS server internal IP (sensitive).
nfs_mount_pathNFS mount path in containers.
nfs_share_pathNFS share path on server.
nfs_setup_jobNFS setup job name.
db_import_jobDatabase import job name.
deployment_summarySummary of the deployment.
cicd_enabledWhether CI/CD pipeline is enabled.
github_repository_urlConnected GitHub repository URL.
github_repository_ownerGitHub repository owner.
github_repository_nameGitHub repository name.
artifact_registry_repositoryArtifact Registry repository.
cloudbuild_trigger_nameCloud Build trigger name.
cloudbuild_trigger_idCloud Build trigger ID.
cicd_configurationCI/CD pipeline configuration details.
kubernetes_readytrue when the GKE cluster endpoint is available and all workload resources have been deployed. false on first apply of a new cluster.

§13 · Platform-Managed Behaviours

BehaviourDetail
OLLAMA_MODELS injectedSet to /mnt/gcs/ollama/models. Do not set this in environment_variables.
OLLAMA_HOST injectedSet to "0.0.0.0:11434" for Kubernetes service forwarding.
OLLAMA_KEEP_ALIVE injectedSet to "24h". Override by setting OLLAMA_KEEP_ALIVE in environment_variables.
Models bucket always provisionedThe <resource_prefix>-models GCS bucket is always created, regardless of create_cloud_storage or storage_buckets settings.
GCS volume always mountedThe ollama-models volume is always appended to gcs_volumes inside Ollama Common.
No database, no Redisenable_redis = false and database_type = "NONE" are hard-coded.
Model-pull job auto-generatedWhen default_model is set and initialization_jobs = [], a Kubernetes Job (model-pull) is created using scripts/model-pull.sh. Providing any entry in initialization_jobs disables it.
Network discoveryThe module uses the App_Common/modules/app_networking module to discover the VPC region from existing subnets. The first discovered region is used as deployment_region. Falls back to var.deployment_region when no subnets are found.
Namespace auto-generatedWhen namespace_name = "", the namespace defaults to <resource_prefix> (the full app<name><tenant><id> string).
scripts_dirSet to Ollama Common's bundled scripts/ directory.

§14 · Variable Reference

Complete variable reference with UIMeta group assignments.

VariableDefaultGroup
module_description(Ollama GKE description)0
module_documentation"https://docs.radmodules.dev/docs/modules/Ollama_GKE"0
module_dependency["Services GCP"]0
module_services(list of GCP services)0
credit_cost1500
require_credit_purchasesfalse0
enable_purgetrue0
public_accesstrue0
deployment_id""0
resource_creator_identity"rad-module-creator@..."0
project_id(required)1
tenant_deployment_id"demo"1
support_users[]1
resource_labels{}1
deployment_region"us-central1"1
application_name"ollama"2
application_display_name"Ollama LLM Server"2
application_description"Ollama — standalone open-source LLM inference server on GKE..."2
application_version"latest"2
deploy_applicationtrue3
workload_type"Deployment"3
container_resources{ cpu_limit="8", memory_limit="16Gi", cpu_request="4", mem_request="8Gi" }3
min_instance_count13
max_instance_count33
timeout_seconds3003
termination_grace_period_seconds603
service_type"ClusterIP"3
session_affinity"None"3
enable_image_mirroringtrue3
enable_vertical_pod_autoscalingfalse3
container_image_source"prebuilt"3
container_image"ollama/ollama"3
container_build_config{ enabled=false }3
container_protocol"http1"3
service_annotations{}3
service_labels{}3
enable_cloudsql_volumefalse3
cloudsql_volume_mount_path"/cloudsql"3
enable_iapfalse4
iap_authorized_users[]4
iap_authorized_groups[]4
iap_oauth_client_id""4
iap_oauth_client_secret""4
iap_support_email""4
enable_cloud_armorfalse4
cloud_armor_policy_name""4
application_domains[]4
enable_custom_domainfalse4
enable_cdnfalse4
reserve_static_ipfalse4
static_ip_name""4
enable_vpc_scfalse4
vpc_cidr_ranges[]4
vpc_sc_dry_runtrue4
organization_id""4
enable_audit_loggingfalse4
admin_ip_ranges[]4
environment_variables{}5
secret_environment_variables{}5
secret_propagation_delay305
secret_rotation_period"2592000s"5
enable_auto_password_rotationfalse5
rotation_propagation_delay_sec905
backup_schedule""6
backup_retention_days76
enable_backup_importfalse6
backup_source"gcs"6
backup_uri""6
backup_format"sql"6
enable_cicd_triggerfalse7
github_repository_url""7
github_token""7
github_app_installation_id""7
cicd_trigger_config{ branch_pattern = "^main$" }7
enable_cloud_deployfalse7
cloud_deploy_stages[dev, staging, prod(approval)]7
enable_binary_authorizationfalse7
enable_custom_sql_scriptsfalse8
custom_sql_scripts_bucket""8
custom_sql_scripts_path""8
custom_sql_scripts_use_rootfalse8
initialization_jobs[]8
cron_jobs[]8
additional_services[]8
create_cloud_storagetrue10
storage_buckets[]10
enable_nfsfalse10
nfs_mount_path"/mnt/nfs"10
nfs_instance_name""10
nfs_instance_base_name"app-nfs"10
gcs_volumes[]10
manage_storage_kms_iamfalse10
enable_artifact_registry_cmekfalse10
database_password_length3211
enable_postgres_extensionsfalse11
postgres_extensions[]11
enable_mysql_pluginsfalse11
mysql_plugins[]11
database_type"NONE"11
startup_probe_config{ enabled=true }13
health_check_config{ enabled=true, initial_delay_seconds=60 }13
uptime_check_config{ enabled=true, path="/" }13
alert_policies[]13
startup_probe{ path="/", initial_delay_seconds=30, failure_threshold=20 }13
liveness_probe{ path="/", initial_delay_seconds=60, failure_threshold=3 }13
gke_cluster_name""15
namespace_name""15
enable_pod_disruption_budgettrue15
pdb_min_available115
enable_network_segmentationfalse15
configure_service_meshfalse15
network_tags[]15
deployment_timeout60015
enable_resource_quotafalse15
stateful_pvc_enabledfalse16
stateful_pvc_size"50Gi"16
stateful_pvc_mount_path"/mnt/data"16
stateful_pvc_storage_class"standard-rwo"16
stateful_headless_servicefalse16
stateful_pod_management_policy"OrderedReady"16
stateful_update_strategy"RollingUpdate"16
default_model""18
model_pull_timeout_seconds360018

§15 · Configuration Examples

Basic Deployment

CPU-only inference for 3B models. Suitable for evaluation and shared internal cluster use.

# config/basic.tfvars
resource_creator_identity = ""
project_id = "my-gcp-project-id"
tenant_deployment_id = "demo"
application_name = "ollama"

container_resources = {
cpu_limit = "8"
memory_limit = "16Gi"
cpu_request = "4"
mem_request = "8Gi"
}

min_instance_count = 1
max_instance_count = 3

service_type = "ClusterIP"

default_model = "llama3.2:3b"

Advanced Deployment

Production inference endpoint for 7B models with pod disruption budget, monitoring, and environment tuning.

# config/advanced.tfvars
resource_creator_identity = ""
project_id = "my-gcp-project-id"
tenant_deployment_id = "prod"
application_name = "ollama"
application_display_name = "Ollama LLM Server"
application_version = "latest"

container_resources = {
cpu_limit = "8"
memory_limit = "16Gi"
cpu_request = "6"
mem_request = "12Gi"
}

min_instance_count = 1
max_instance_count = 5

workload_type = "Deployment"
service_type = "ClusterIP"

default_model = "mistral"
model_pull_timeout_seconds = 3600

environment_variables = {
OLLAMA_NUM_PARALLEL = "2"
OLLAMA_KEEP_ALIVE = "24h"
}

enable_pod_disruption_budget = true
pdb_min_available = 1

support_users = ["ops@example.com"]
resource_labels = {
env = "production"
team = "ai-platform"
service = "ollama"
}

enable_image_mirroring = true

Configuration Pitfalls & Sensible Defaults

Risk levels: Critical (data loss, full outage, security breach) — High (service unavailable or significant degradation) — Medium (degraded function or increased cost) — Low (minor impact).

VariableSensible DefaultRiskConsequence of Incorrect Value
container_resources.memory_limit16Gi (7B) / 8Gi (3B)CriticalInsufficient memory causes OOM-kill mid-inference and crash-loops the pod. GKE Autopilot will not schedule the pod if the requested memory exceeds node capacity. Allocate at least 2× the quantised model weight size.
container_resources.cpu_limit8 (7B) / 4 (3B)HighToo few CPUs causes multi-second token latency on large models. For production throughput on 7B models, 6–8 cores are recommended.
container_resources (GPU)CPU-only defaultsHighThe module documentation notes NVIDIA L4 GPU support. To enable GPU inference, you must provision an NVIDIA L4 node pool in the GKE Autopilot cluster (outside this module) and add nvidia.com/gpu: 1 as a resource request via container_resources. Without this, GPU acceleration is silently not used and inference runs on CPU.
min_instance_count1HighSetting to 0 allows scale-to-zero but causes 60–120 s cold starts (GCS Fuse mount + model load). Inappropriate for low-latency inference workloads.
model_pull_timeout_seconds3600HighA short timeout (e.g., 300) causes the model-pull Kubernetes Job to fail before the download completes for models larger than ~2 GB. The service starts but no default model is loaded.
default_model"" (skip pull)MediumIf no model is pulled at deploy time, the Ollama API returns an error on all inference requests until a model is manually pulled.
workload_typenull (auto-select)MediumUsing "StatefulSet" without enabling stateful_pvc_enabled results in pods with no persistent local storage. Conversely, setting stateful_pvc_enabled = true while specifying workload_type = "Deployment" fails at plan time.
service_type"ClusterIP"CriticalSetting service_type = "LoadBalancer" exposes the Ollama API (port 11434) publicly without authentication. Ollama has no built-in auth. Always keep ClusterIP for internal cluster-only access.
environment_variables.OLLAMA_ORIGINS"*" (Ollama default)HighIf not explicitly restricted, CORS accepts any origin. Set to the specific UI origins (e.g., "http://openwebui.namespace.svc.cluster.local:8080") to prevent cross-cluster or external browser access to the API.
environment_variables.OLLAMA_KEEP_ALIVE"5m" (Ollama default)MediumOllama evicts models from GPU/CPU memory after 5 minutes idle. Subsequent requests trigger a full reload (30–60 s). Set to "24h" or "-1" for always-warm deployments.
environment_variables.OLLAMA_NUM_PARALLEL1MediumSerialises all inference requests. For shared cluster deployments with multiple callers (OpenWebUI, Flowise, N8N), increase to 24 based on available CPU/memory.
stateful_pvc_enabledfalseMediumUsing GCS Fuse (default) means model weight loading at startup incurs network latency. A local PVC (StatefulSet) eliminates this but prevents pod migration during node pool upgrades.
quota_memory_requests / quota_memory_limits"16Gi" / "32Gi"CriticalMust use binary unit suffixes (Gi, Mi). A bare integer (e.g., "4") is treated as bytes by Kubernetes and will block all pod scheduling in the namespace.
enable_pod_disruption_budgettrueMediumWith pdb_min_available = 1 and only one replica, rolling node upgrades stall because Kubernetes cannot evict the single pod. Ensure max_instance_count ≥ 2 when using a PDB with pdb_min_available = 1.
deployment_timeout600HighThe default 600 s wait for the Deployment to become ready may be insufficient when pulling a large model (13B+) from GCS for the first time. Increase to 1200 for models over 8 GB.
session_affinity"None"LowSetting to "ClientIP" routes all requests from the same caller to the same pod, which improves context continuity for multi-turn conversations but unevenly distributes load across replicas.
enable_resource_quotafalseMediumWithout a ResourceQuota, a misconfigured or runaway Ollama pod can consume all cluster resources. Enable and tune quotas in shared clusters.
enable_image_mirroringtrueMediumDisabling pulls directly from Docker Hub (rate-limited). Keep true in production.
gcs_volumes mount optionsimplicit-dirsMediumWithout implicit-dirs in GCS Fuse mount options, directory listings fail and Ollama cannot discover models stored in GCS subdirectories.
max_instance_count3HighEach pod independently loads the full model into memory. For a 7B model (16 GiB each), three replicas require 48 GiB of cluster memory. Size the node pool accordingly or cap at 1 for large models.