Skip to main content

Ollama CloudRun Module — Configuration Guide

Ollama is an open-source LLM inference server that serves large language models such as Llama, Mistral, Gemma, and Phi via a REST API on port 11434. This module deploys Ollama on Google Cloud Run (serverless, CPU-only) with model weights persisted to a GCS Fuse volume so that container restarts load models from storage rather than re-downloading them.

Ollama CloudRun is a wrapper module built on top of App CloudRun. It delegates all GCP infrastructure provisioning to App CloudRun (Cloud Run service, networking, Secret Manager, GCS, CI/CD) and uses an Ollama Common sub-module to supply Ollama-specific application configuration, the GCS models bucket, and the optional model-pull initialization job. The Ollama Common outputs feed into App CloudRun's application_config, module_storage_buckets, and scripts_dir inputs.

This module is designed as a shared AI inference endpoint. Any workload in the same VPC can call http://<service-url>:11434. For GPU-accelerated inference use Ollama_GKE with an NVIDIA L4 node pool.


§1 · Module Overview

What Ollama CloudRun provides

  • An Ollama container (prebuilt image ollama/ollama from Docker Hub, enable_image_mirroring = true by default) deployed on Cloud Run listening on port 11434.
  • A GCS bucket (<resource_prefix>-models) mounted via GCS Fuse at /mnt/gcs. The environment variable OLLAMA_MODELS is set to /mnt/gcs/ollama/models so model weights survive container restarts and new revisions load instantly from GCS.
  • An optional model-pull initialization job (Cloud Run Job) that starts a local Ollama server in the background, pulls the specified model, and stores it in the GCS bucket. This job only runs when default_model is non-empty and initialization_jobs = [].
  • No database, no Redis, no NFS — Ollama is stateless beyond its GCS-backed model cache.

Key differences from App CloudRun defaults

FeatureApp CloudRun defaultOllama CloudRun default
container_port808011434
cpu_limit"1000m""4000m"
memory_limit"512Mi""8Gi"
min_instance_count01
max_instance_count11
ingress_settings"all""internal"
execution_environment"gen1""gen2" (required for GCS Fuse)
timeout_seconds603600
enable_redisvariesalways false (hard-coded)
DatabasevariesNONE (hard-coded, no Cloud SQL)
GCS models bucketnoneauto-provisioned via Ollama Common
Model-pull jobnoneauto-generated when default_model is set
Auto-injected env varsnoneOLLAMA_MODELS, OLLAMA_HOST, OLLAMA_KEEP_ALIVE

§2 · IAM & Project Identity

VariableTypeDefaultDescription
project_idstringrequiredGCP project into which all resources are deployed.
tenant_deployment_idstring"demo"Short suffix appended to resource names. 1–20 lowercase letters, numbers, hyphens.
resource_creator_identitystring"rad-module-creator@tec-rad-ui-2b65.iam.gserviceaccount.com"Service account used by Terraform.
support_userslist(string)[]Email addresses granted IAM access and added to monitoring alert channels.
resource_labelsmap(string){}Labels applied to all module-managed resources.
module_descriptionstring(Ollama CloudRun description)Platform UI description.
module_documentationstring"https://docs.radmodules.dev/docs/modules/Ollama_CloudRun"External documentation URL.
module_dependencylist(string)["Services GCP"]Modules that must be deployed before this one.
module_serviceslist(string)(GCP service list)GCP services consumed by this module.
credit_costnumber50Platform credits consumed on deployment.
require_credit_purchasesboolfalseEnforce credit balance check before deployment.
enable_purgebooltruePermit full deletion of all module resources on destroy.
public_accessbooltrueControls platform UI visibility.
deployment_idstring""Optional fixed deployment ID. Auto-generated when blank.

§3 · Core Service Configuration

§3.A · Application Identity

VariableTypeDefaultDescription
application_namestring"ollama"Base name for the Cloud Run service, Artifact Registry repo, and GCS bucket. Do not change after initial deployment.
application_display_namestring"Ollama LLM Server"Human-readable name shown in the platform UI.
descriptionstring"Ollama — standalone open-source LLM inference server..."Brief description surfaced in resource metadata.
application_versionstring"latest"Ollama Docker image tag. Use a pinned tag (e.g. "0.3.12") in production.

§3.B · Ollama Model Configuration (Group 18)

These are the Ollama-specific variables that have no equivalent in other wrapper modules.

VariableTypeDefaultDescription
default_modelstring""Model to pull on first deployment. Examples: "llama3.2:3b" (~2 GB), "mistral" (~4 GB), "llama3:8b" (~5 GB). Leave empty to skip the auto-pull job. Stored in GCS and loaded on every startup.
model_pull_timeout_secondsnumber3600Timeout in seconds for the model-pull initialization job. Large models can take 10–30 minutes on first pull. Valid range: 300–7200.

When default_model is set and initialization_jobs is empty, the module automatically creates a Cloud Run Job named model-pull that:

  1. Starts a local Ollama server in the background.
  2. Polls http://localhost:11434/ until ready (up to 30 retries, 3 seconds apart).
  3. Runs ollama pull $OLLAMA_MODEL.
  4. Shuts down the server.

The job mounts the ollama-models GCS volume so the pulled weights persist into the shared models bucket.

§3.C · Runtime & Scaling (Group 3)

VariableTypeDefaultDescription
deploy_applicationbooltrueSet false to provision storage and IAM without deploying the Cloud Run service.
cpu_limitstring"4000m"CPU limit per container. 3B models: "4000m"; 7B models: "8000m". Cloud Run max is "8".
memory_limitstring"8Gi"Memory limit per container. 3B models: "8Gi"; 7B models: "16Gi".
min_instance_countnumber1Minimum instances. 1 keeps a warm instance to avoid 60–120 s cold-start model loading. 0 enables scale-to-zero at the cost of latency.
max_instance_countnumber1Maximum concurrent instances. LLM inference is CPU-saturating; multiple instances rarely help unless requests are fully independent.
execution_environmentstring"gen2"Cloud Run execution generation. "gen2" is required for GCS Fuse support.
timeout_secondsnumber3600Maximum request duration. Inference on large prompts can be slow — 3600 s is the maximum. Valid range: 0–3600.
container_protocolstring"http1"HTTP protocol. Use "h2c" only if all callers support HTTP/2 cleartext.
traffic_splitlist(object)[]Traffic allocation across Cloud Run revisions. Empty sends all traffic to the latest revision. All entries must sum to 100.
enable_image_mirroringbooltrueMirror ollama/ollama to Artifact Registry before deployment to avoid Docker Hub rate limits.
service_annotationsmap(string){}Custom annotations applied to the Cloud Run service.
service_labelsmap(string){}Custom labels applied to the Cloud Run service.
cloudsql_volume_mount_pathstring"/cloudsql"Required by the App CloudRun interface; not used by Ollama (no database).

§3.D · Automatically Injected Environment Variables

The following environment variables are set automatically by Ollama Common and must not be overridden in environment_variables:

VariableValuePurpose
OLLAMA_MODELS"/mnt/gcs/ollama/models"Points Ollama at the GCS Fuse subdirectory for model persistence.
OLLAMA_HOST"0.0.0.0:11434"Binds to all interfaces so Cloud Run can forward traffic.
OLLAMA_KEEP_ALIVE"24h"Keeps loaded model in memory between requests to reduce per-request latency.

Additional variables can be passed via environment_variables (e.g. OLLAMA_NUM_PARALLEL to allow concurrent inferences on multi-CPU instances).

§3.E · Environment Variables & Secrets (Group 5)

VariableTypeDefaultDescription
environment_variablesmap(string){}Additional plain-text env vars. The three Ollama-specific vars above are injected automatically.
secret_environment_variablesmap(string){}Map of env var name → Secret Manager secret name, injected at runtime.
secret_propagation_delaynumber30Seconds to wait after secret creation before proceeding. Valid range: 0–300.
secret_rotation_periodstring"2592000s"Pub/Sub rotation notification period (30 days). Set to null to disable. Format: "<seconds>s".
enable_auto_password_rotationboolfalseNot applicable for Ollama (no database).
rotation_propagation_delay_secnumber90Seconds to wait after rotation before restarting the service.

§3.F · Access & Networking (Group 4)

The default ingress_settings = "internal" is intentional — the Ollama API is designed to be called from within the same VPC by other applications (Flowise, N8N, RAGFlow, Django) rather than from the public internet.

VariableTypeDefaultOptionsDescription
ingress_settingsstring"internal"all / internal / internal-and-cloud-load-balancing"internal" restricts access to the VPC. Use "all" only if external callers need direct API access.
vpc_egress_settingstring"PRIVATE_RANGES_ONLY"ALL_TRAFFIC / PRIVATE_RANGES_ONLYRoutes only RFC 1918 outbound traffic via VPC.
enable_iapboolfalseEnable Identity-Aware Proxy.
iap_authorized_userslist(string)[]Users granted access through IAP.
iap_authorized_groupslist(string)[]Google Groups granted access through IAP.
enable_vpc_scboolfalseEnable VPC Service Controls perimeter enforcement.
vpc_cidr_rangeslist(string)[]VPC subnet CIDR ranges for the VPC-SC network access level.
vpc_sc_dry_runbooltrueLog VPC-SC violations without blocking.
organization_idstring""GCP Organization ID for VPC-SC Access Context Manager.
enable_audit_loggingboolfalseEnable detailed Cloud Audit Logs.
admin_ip_rangeslist(string)[]CIDR ranges for administrative access.
enable_cloud_armorboolfalseEnable Cloud Armor WAF fronted by a Global HTTPS Load Balancer.
application_domainslist(string)[]Custom domain names for the Cloud Armor load balancer.
enable_cdnboolfalseEnable Cloud CDN on the Global HTTPS Load Balancer.

§4 · Storage & Filesystem (Group 10)

The Ollama models bucket is always provisioned automatically — no user configuration is required to enable GCS model persistence.

VariableTypeDefaultDescription
create_cloud_storagebooltrueProvision GCS buckets defined in storage_buckets. The models bucket is always created regardless.
storage_bucketslist(object)[]Additional GCS buckets to provision beyond the Ollama models bucket.
enable_nfsboolfalseProvision and mount a Cloud Filestore NFS volume. Not required for Ollama (uses GCS).
nfs_mount_pathstring"/mnt/nfs"Filesystem path for the NFS volume.
nfs_instance_namestring""Name of an existing NFS GCE VM. Auto-discovered when empty.
nfs_instance_base_namestring"app-nfs"Base name for an inline NFS GCE VM.
gcs_volumeslist(object)[]Additional GCS buckets to mount as GCS Fuse volumes. The ollama-models bucket is always appended automatically.
manage_storage_kms_iamboolfalseCreate a CMEK KMS key for GCS encryption.
enable_artifact_registry_cmekboolfalseEnable CMEK encryption for Artifact Registry.

GCS volume layout:

<resource_prefix>-models/          ← GCS bucket root
└── ollama/
└── models/ ← /mnt/gcs/ollama/models (OLLAMA_MODELS)

§5 · Backup & Maintenance (Group 6)

Ollama has no database — backup and import settings are present for interface compatibility with App CloudRun but have no operational effect.

VariableTypeDefaultDescription
backup_schedulestring""Not applicable for Ollama. Models are stored durably in GCS.
backup_retention_daysnumber7Days to retain backup files.
enable_backup_importboolfalseNot applicable for Ollama.
backup_sourcestring"gcs"Source system: "gcs" or "gdrive".
backup_uristring""Location of the backup file.
backup_formatstring"sql"Format: sql, tar, gz, tgz, tar.gz, zip, auto.

§6 · CI/CD Integration (Group 7)

VariableTypeDefaultDescription
enable_cicd_triggerboolfalseCreate a Cloud Build trigger on GitHub pushes.
github_repository_urlstring""Full HTTPS URL of the GitHub repository.
github_tokenstring""GitHub Personal Access Token. Sensitive.
github_app_installation_idstring""Cloud Build GitHub App installation ID.
cicd_trigger_configobject{ branch_pattern = "^main$" }Branch filter, included/ignored paths, trigger name, and build substitutions.
enable_cloud_deployboolfalseSwitch to a Cloud Deploy pipeline with promotion stages.
cloud_deploy_stageslist(object)[dev, staging, prod(approval)]Ordered promotion stages.
enable_binary_authorizationboolfalseEnforce Binary Authorization for signed container images.

Artifact Registry Image Lifecycle

VariableTypeDefaultDescription
max_images_to_retainnumber7Maximum number of recent images to keep in Artifact Registry.
delete_untagged_imagesbooltrueDelete untagged (dangling) images automatically.
image_retention_daysnumber30Days after which images are eligible for deletion. 0 disables age-based deletion.
max_revisions_to_retainnumber7Maximum Cloud Run revisions to keep after each deployment.

§7 · Custom Initialization & Jobs (Group 8)

VariableTypeDefaultDescription
enable_custom_sql_scriptsboolfalseNot applicable for Ollama.
custom_sql_scripts_bucketstring""GCS bucket containing SQL scripts.
custom_sql_scripts_pathstring""Path prefix within the bucket.
custom_sql_scripts_use_rootboolfalseExecute scripts as root database user.
initialization_jobslist(object)[]Cloud Run jobs executed once during deployment. When non-empty, overrides the automatic model-pull job entirely. See §3.B for the auto-generated job schema.
cron_jobslist(object)[]Recurring Cloud Run jobs triggered by Cloud Scheduler.

§8 · Database Backend (Group 11)

Ollama has no database dependency. Redis is also disabled for this module.

VariableTypeDefaultDescription
database_password_lengthnumber32Not used by Ollama. Present for App CloudRun interface compatibility. Valid range: 16–64.

Hard-coded values (not user-configurable):

  • enable_redis = false
  • database_type = "NONE" (set via Ollama Common)

§9 · Observability & Health (Group 13)

Ollama's root endpoint (/) responds with "Ollama is running" once the server is ready. Probes target this path.

VariableTypeDefaultDescription
startup_probeobject{ enabled=true, type="HTTP", path="/", initial_delay_seconds=30, timeout_seconds=5, period_seconds=15, failure_threshold=20 }Startup probe forwarded through Ollama Common. The 30 s initial delay and 20-attempt threshold allow up to ~5 minutes for model loading from GCS on first start.
liveness_probeobject{ enabled=true, type="HTTP", path="/", initial_delay_seconds=60, timeout_seconds=5, period_seconds=30, failure_threshold=3 }Liveness probe. 60 s initial delay avoids false restarts during the model-load phase.
startup_probe_configobject{ enabled=true }Structured startup probe passed directly to App CloudRun (240 s timeout by default).
health_check_configobject{ enabled=true }Structured liveness probe passed directly to App CloudRun.
uptime_check_configobject{ enabled=true, path="/", check_interval="60s", timeout="10s" }Cloud Monitoring uptime check from multiple global locations.
alert_policieslist(object)[]Cloud Monitoring alert policies notifying support_users.

§10 · Outputs

OutputDescription
service_nameName of the Cloud Run service.
service_urlCloud Run service URL (HTTPS).
ollama_api_urlOllama REST API base URL — append /api/generate, /api/chat, etc. Constructed as <service_url>/api.
service_locationGCP region of the Cloud Run service.
models_bucketGCS bucket name where Ollama model weights are persisted (<resource_prefix>-models).
storage_bucketsAll provisioned GCS buckets.
network_nameVPC network name.
network_existsWhether the VPC network exists.
regionsAvailable regions in the VPC.
container_imageContainer image URI used by the service.
container_registryArtifact Registry repository name.
deployment_idUnique deployment identifier.
tenant_idTenant identifier.
resource_prefixResource naming prefix (app<name><tenant><id>).
project_idGCP project ID.
project_numberGCP project number.
monitoring_enabledWhether Cloud Monitoring is configured.
monitoring_notification_channelsMonitoring notification channel names.
uptime_check_namesUptime check configuration names.
initialization_jobsCreated initialization job names.
nfs_server_ipNFS server internal IP (sensitive).
nfs_mount_pathNFS mount path in containers.
nfs_share_pathNFS share path on server.
nfs_setup_jobNFS setup job name.
stage_servicesMap of stage names to Cloud Run service details (for Cloud Deploy).
deployment_summarySummary of the deployment configuration.
cicd_enabledWhether CI/CD pipeline is enabled.
github_repository_urlGitHub repository URL connected for CI/CD.
github_repository_ownerGitHub repository owner/organization.
github_repository_nameGitHub repository name.
artifact_registry_repositoryArtifact Registry repository.
cloudbuild_trigger_nameCloud Build trigger name.
cloudbuild_trigger_idCloud Build trigger ID.
cicd_configurationCI/CD pipeline configuration details.

§11 · Platform-Managed Behaviours

The following behaviours are applied automatically and cannot be overridden via tfvars.

BehaviourDetail
OLLAMA_MODELS injectedSet to /mnt/gcs/ollama/models — the GCS Fuse subdirectory inside the auto-provisioned models bucket. Do not set this in environment_variables.
OLLAMA_HOST injectedSet to "0.0.0.0:11434" so Cloud Run's ingress can forward traffic to the container.
OLLAMA_KEEP_ALIVE injectedSet to "24h" to keep the loaded model resident in memory between requests. Override by setting OLLAMA_KEEP_ALIVE in environment_variables.
Models bucket always provisionedThe <resource_prefix>-models GCS bucket is always created via Ollama Common.storage_buckets, regardless of create_cloud_storage or storage_buckets settings.
GCS volume always mountedThe ollama-models volume is always appended to gcs_volumes. Additional volumes specified in gcs_volumes are merged before the models volume.
execution_environment = "gen2" defaultGCS Fuse requires the Cloud Run gen2 execution environment. The default enforces this.
No database, no Redisenable_redis = false and database_type = "NONE" are hard-coded in main.tf. These cannot be changed.
Model-pull job auto-generatedWhen default_model is set and initialization_jobs = [], a Cloud Run Job (model-pull) is created automatically using the scripts/model-pull.sh script from Ollama Common. Providing any entry in initialization_jobs disables the auto-generated job entirely.
scripts_dirSet to Ollama Common's bundled scripts/ directory.

§12 · Variable Reference

Complete variable reference with UIMeta group assignments.

VariableDefaultGroup
module_description(Ollama CloudRun description)0
module_documentation"https://docs.radmodules.dev/docs/modules/Ollama_CloudRun"0
module_dependency["Services GCP"]0
module_services(list of GCP services)0
credit_cost500
require_credit_purchasesfalse0
enable_purgetrue0
public_accesstrue0
deployment_id""0
resource_creator_identity"rad-module-creator@..."0
project_id(required)1
tenant_deployment_id"demo"1
support_users[]1
resource_labels{}1
application_name"ollama"2
application_display_name"Ollama LLM Server"2
description"Ollama — standalone open-source LLM inference server..."2
application_version"latest"2
deploy_applicationtrue3
cpu_limit"4000m"3
memory_limit"8Gi"3
min_instance_count13
max_instance_count13
execution_environment"gen2"3
timeout_seconds36003
container_protocol"http1"3
traffic_split[]3
enable_image_mirroringtrue3
service_annotations{}3
service_labels{}3
cloudsql_volume_mount_path"/cloudsql"3
ingress_settings"internal"4
vpc_egress_setting"PRIVATE_RANGES_ONLY"4
enable_iapfalse4
iap_authorized_users[]4
iap_authorized_groups[]4
enable_vpc_scfalse4
vpc_cidr_ranges[]4
vpc_sc_dry_runtrue4
organization_id""4
enable_audit_loggingfalse4
admin_ip_ranges[]4
enable_cloud_armorfalse4
application_domains[]4
enable_cdnfalse4
environment_variables{}5
secret_environment_variables{}5
secret_propagation_delay305
secret_rotation_period"2592000s"5
enable_auto_password_rotationfalse5
rotation_propagation_delay_sec905
backup_schedule""6
backup_retention_days76
enable_backup_importfalse6
backup_source"gcs"6
backup_uri""6
backup_format"sql"6
enable_cicd_triggerfalse7
github_repository_url""7
github_token""7
github_app_installation_id""7
cicd_trigger_config{ branch_pattern = "^main$" }7
enable_cloud_deployfalse7
cloud_deploy_stages[dev, staging, prod(approval)]7
enable_binary_authorizationfalse7
enable_custom_sql_scriptsfalse8
custom_sql_scripts_bucket""8
custom_sql_scripts_path""8
custom_sql_scripts_use_rootfalse8
initialization_jobs[]8
cron_jobs[]8
create_cloud_storagetrue10
storage_buckets[]10
enable_nfsfalse10
nfs_mount_path"/mnt/nfs"10
nfs_instance_name""10
nfs_instance_base_name"app-nfs"10
gcs_volumes[]10
manage_storage_kms_iamfalse10
enable_artifact_registry_cmekfalse10
database_password_length3211
startup_probe_config{ enabled=true }13
health_check_config{ enabled=true }13
uptime_check_config{ enabled=true, path="/" }13
alert_policies[]13
startup_probe{ path="/", initial_delay_seconds=30, failure_threshold=20 }13
liveness_probe{ path="/", initial_delay_seconds=60, failure_threshold=3 }13
max_images_to_retain713
delete_untagged_imagestrue13
image_retention_days3013
max_revisions_to_retain713
default_model""18
model_pull_timeout_seconds360018

§13 · Configuration Examples

Basic Deployment

CPU-only inference for 3B models. Suitable for development and shared internal API use.

# config/basic.tfvars
resource_creator_identity = ""
project_id = "my-gcp-project-id"
tenant_deployment_id = "demo"
application_name = "ollama"

cpu_limit = "4000m"
memory_limit = "8Gi"

min_instance_count = 1
max_instance_count = 1

ingress_settings = "internal"

default_model = "llama3.2:3b"

Advanced Deployment

Production inference endpoint for 7B models with monitoring, environment tuning, and mirroring.

# config/advanced.tfvars
resource_creator_identity = ""
project_id = "my-gcp-project-id"
tenant_deployment_id = "prod"
application_name = "ollama"
application_display_name = "Ollama LLM Server"
application_version = "latest"

cpu_limit = "8000m"
memory_limit = "16Gi"

min_instance_count = 1
max_instance_count = 5

timeout_seconds = 600
ingress_settings = "internal"

default_model = "mistral"
model_pull_timeout_seconds = 3600

environment_variables = {
OLLAMA_NUM_PARALLEL = "2"
OLLAMA_KEEP_ALIVE = "24h"
}

support_users = ["ops@example.com"]
resource_labels = {
env = "production"
team = "ai-platform"
service = "ollama"
}

enable_image_mirroring = true

Configuration Pitfalls & Sensible Defaults

Risk levels: Critical (data loss, full outage, security breach) — High (service unavailable or significant degradation) — Medium (degraded function or increased cost) — Low (minor impact).

VariableSensible DefaultRiskConsequence of Incorrect Value
memory_limit8Gi (3B model) / 16Gi (7B model)CriticalInsufficient memory causes OOM-kill mid-inference; container crashes in a restart loop. Cloud Run max is 32Gi — allocate at least 2× the model's quantised weight size.
cpu_limit4000m (3B) / 8000m (7B)HighInsufficient CPU makes token generation extremely slow (minutes per token on 7B). Cloud Run CPU is throttled when not handling a request — set cpu_always_allocated = true (via annotations) when using min_instance_count ≥ 1.
min_instance_count1HighSetting to 0 enables scale-to-zero but causes 60–120 s cold starts while the model reloads from GCS. Inappropriate for interactive use cases or services where other modules call Ollama synchronously.
default_model"" (empty, skip pull)MediumLeaving empty is safe for initial infrastructure-only deploy but the service returns 404 on inference until a model is pulled manually or via a subsequent deploy with this variable set.
model_pull_timeout_seconds3600HighToo short a timeout (e.g., 300) causes the init job to fail midway through pulling a large model (7B = ~4 GB, 13B+ = 8 GB+). The Cloud Run service then starts without the expected model. Default 3600 s is appropriate for most cases; increase to 7200 for 70B+ models.
application_version"latest"MediumUsing latest means a new Ollama version is pulled on each redeploy. Pin to a specific tag (e.g., "0.3.12") in production to prevent unintended API-breaking upgrades.
ingress_settings"all"Critical"all" exposes the Ollama REST API publicly — any caller can load, query, or delete models without authentication. Set to "internal" for VPC-only access or front with IAP. Ollama has no built-in auth.
vpc_egress_setting"PRIVATE_RANGES_ONLY"MediumRequired when Ollama needs to reach other VPC services (e.g., a NFS volume or another Cloud Run service). If set incorrectly, requests to VPC-internal addresses time out.
enable_iapfalseCriticalWithout IAP or VPC restriction (ingress_settings = "internal"), the Ollama API is unauthenticated and publicly reachable. Always enable IAP or restrict ingress for any non-isolated deployment.
environment_variables.OLLAMA_ORIGINS"*" (Ollama default)HighIf not explicitly set, Ollama's CORS policy defaults to accepting any origin, enabling cross-site requests to the API from arbitrary browser contexts. Set to the specific UI origin (e.g., "https://openwebui.example.com") when exposing through a load balancer.
environment_variables.OLLAMA_KEEP_ALIVE"5m" (Ollama default)MediumOllama unloads a model from memory after 5 minutes of inactivity by default. This causes a 30–60 s reload delay on the next request. Set OLLAMA_KEEP_ALIVE = "24h" for warm-always deployments or "-1" to never unload.
environment_variables.OLLAMA_NUM_PARALLEL1 (Ollama default)MediumDefault of 1 serialises all inference requests. For production use with concurrent callers (OpenWebUI, N8N, Flowise), increase to 24 to allow parallel request handling, subject to available memory.
max_instance_count3HighEach Ollama instance independently loads the model into memory. Multiple instances are safe with GCS Fuse persistence but significantly increase cost. For large models, set to 1 unless horizontal scaling is explicitly required.
storage_bucketsAuto-provisionedHighThe GCS bucket for model weights is auto-provisioned. If enable_purge = true and the module is destroyed, all downloaded model weights are deleted permanently. Set enable_purge = false on the storage bucket to protect models.
gcs_volumes mount optionsimplicit-dirsMediumOmitting implicit-dirs from the GCS Fuse mount options causes directory listings to fail, breaking Ollama's model discovery. Always include this option.
startup_probe.failure_threshold20HighOllama can take 60–120 s to mount GCS and initialise. A low failure_threshold (e.g., 5) combined with a short period_seconds causes Cloud Run to kill and restart the container before it is ready.
enable_cloud_armorfalseMediumCloud Armor is not enabled by default. Without it, DDoS and rate-limiting protection is absent. Consider enabling when ingress_settings = "all".
timeout_seconds300HighLLM inference for large models can exceed 5 minutes per request. Requests to long-running generations are terminated with a 504 if timeout_seconds is too low. Increase to 3600 for interactive multi-turn inference.
enable_image_mirroringtrueMediumSetting to false pulls directly from Docker Hub at deploy time, which is subject to rate limits. Keep true in production to use Artifact Registry.
backup_schedule"" (disabled)MediumWithout a backup schedule, there is no automated protection for stored model configurations. Model weights themselves are stored in GCS and are inherently durable, but initialization-job metadata is not automatically backed up.
execution_environment"gen2"HighGen1 does not support NFS mounts or Direct VPC Egress. If enable_nfs = true, gen1 silently fails to mount the volume. Always use gen2.

Destroying Resources

Known Deletion Issue: Serverless IPv4 Address Release

When destroying a Cloud Run deployment, you may encounter an error similar to:

Error: Error waiting for Subnetwork to be deleted: The following serverless IPv4 address(es) on subnet ... are still in use.

Cause: GCP holds serverless IPv4 addresses on the VPC subnet asynchronously after a Cloud Run service is deleted. These addresses are released by GCP approximately 20–30 minutes after the Cloud Run service is removed. Terraform/OpenTofu cannot complete the subnet or VPC deletion until they are fully released.

Resolution: Wait 20–30 minutes after the initial destroy attempt, then re-run the destroy command:

tofu destroy

The second run will succeed once GCP has released the reserved addresses.