Skip to main content

Ollama on GKE Autopilot — Lab Guide

📖 Configuration Guide

Overview

Estimated time: 1–2 hours

Ollama is a standalone open-source LLM inference server that runs large language models (Llama, Mistral, Gemma, Phi, and others) via a REST API on port 11434. This lab deploys Ollama on GKE Autopilot with model weights persisted to a GCS bucket via GCS Fuse CSI Driver. No database is required. Other pods in the same cluster can call the Ollama API at its ClusterIP URL.

What the Module Automates

  • GKE Autopilot Deployment + ClusterIP Service + HPA
  • GCS bucket for model weight storage
  • GCS Fuse CSI volume mount for persistent model storage
  • Artifact Registry repository and image mirroring
  • Workload Identity for GCS bucket access
  • Secret Manager integration
  • Cloud Monitoring uptime checks and notification channels
  • Optional model-pull initialization job (when default_model is set)

What You Do Manually

  • Connect to the cluster and verify the Ollama pod
  • Port-forward to access the Ollama API locally
  • List available models
  • Pull and run a model
  • Use the chat completion API
  • Explore model management
  • Verify GCS model storage persistence
  • Explore Cloud Logging and Cloud Monitoring

CLI and REST API Overview

This lab uses three sets of tools:

ToolPurpose
gcloudInteract with GCP services (GCS, logs, metrics)
kubectlManage Kubernetes workloads and port-forward
curlCall the Ollama REST API (port 11434)

Note: Ollama is deployed with a ClusterIP service by default, meaning it is accessible only within the GKE cluster. To call the API from your local machine, use kubectl port-forward. Other workloads in the same cluster (e.g., Flowise, N8N) can call http://ollama.<namespace>.svc.cluster.local:11434 directly.


Prerequisites

  • GCP project with billing enabled
  • Services GCP module deployed (provides VPC and GKE Autopilot cluster)
  • gcloud CLI authenticated (gcloud auth application-default login)
  • kubectl configured or configurable via gcloud container clusters get-credentials
  • Access to the RAD UI with permission to deploy modules in the target GCP project
  • Sufficient CPU and memory quota: 3B models require ~8 GB RAM; 7B models require ~16 GB RAM

Phase 1 — Deploy [AUTOMATED]

Variables

In the RAD UI, open the Ollama GKE module and fill in the deployment form:

VariableRequiredDefaultDescription
project_idYesGCP project ID
deployment_idNoauto-generatedShort alphanumeric suffix for all resource names
regionNous-central1GCP region
application_nameNoollamaBase name for Kubernetes resources and GCS bucket
application_versionNolatestOllama Docker image tag
deploy_applicationNotrueSet false to provision storage and IAM only
gke_cluster_nameNoauto-discoverName of the GKE Autopilot cluster
default_modelNo""Model to pre-pull on first deployment (e.g., llama3.2:3b)
model_pull_timeout_secondsNo3600Timeout for model pull job (300–7200 s)
min_instance_countNo1Minimum pod replicas (set to 1 to keep a warm instance)
max_instance_countNo3Maximum pod replicas for HPA
container_resourcesNocpu=8, mem=16GiPod CPU and memory limits
service_typeNoClusterIPKubernetes Service type (use ClusterIP to keep API internal)
workload_typeNoDeploymentUse Deployment for GCS-backed storage
timeout_secondsNo300Pod termination grace period

Deploy

Click Deploy in the RAD UI.

Estimated Deployment Duration

StepEstimated Time
Artifact Registry image mirror3–5 minutes
GKE Autopilot pod scheduling3–5 minutes
GCS Fuse volume mount1–2 minutes
Model pull job (if default_model set)5–30 minutes (model size dependent)
Total15–45 minutes

Key Outputs

After deployment completes, the following outputs are available in the RAD UI deployment panel:

OutputDescription
ollama_cluster_urlInternal cluster URL: http://ollama.<namespace>.svc.cluster.local:11434
service_nameKubernetes service name
namespaceKubernetes namespace
service_cluster_ipClusterIP address
models_bucketGCS bucket name where model weights are stored
storage_bucketsAll created GCS bucket names
deployment_idUnique deployment suffix

Set shell variables for use in later steps:

export PROJECT="your-gcp-project-id"   # set this first — your GCP project ID
export REGION="us-central1" # the region you deployed into
export TOKEN=$(gcloud auth print-access-token)

# Discover the GKE cluster
export CLUSTER=$(gcloud container clusters list \
--project=${PROJECT} \
--format="value(name)" \
--limit=1)

# Configure kubectl
gcloud container clusters get-credentials ${CLUSTER} \
--region=${REGION} \
--project=${PROJECT}

# Discover the namespace (pattern: appollama<tenant><deploymentid>)
export NAMESPACE=$(kubectl get namespaces --no-headers \
-o custom-columns=":metadata.name" | grep "^appollama" | head -1)

Phase 2 — Connect to the Cluster [MANUAL]

Goal: Authenticate kubectl, verify the Ollama pod, and set up port-forwarding.

  1. Get credentials for the GKE cluster:

    gcloud container clusters get-credentials <cluster-name> \
    --region <region> \
    --project <project-id>

    Expected result: kubeconfig entry generated for <cluster-name>

  2. Find the Ollama namespace:

    kubectl get namespaces | grep ollama
  3. Verify the pod is running:

    kubectl get pods -n ${NAMESPACE}

    Expected result: A pod with name starting ollama- in Running status.

    Note: If default_model was set, wait for the model-pull initialization job to complete before proceeding. You can check the job status with:

    kubectl get jobs -n ${NAMESPACE}
  4. Port-forward the Ollama service to your local machine:

    kubectl port-forward svc/<service-name> 11434:11434 -n ${NAMESPACE}

    Leave this running in a separate terminal window.

    Expected result: Forwarding from 127.0.0.1:11434 -> 11434

  5. Verify Ollama is responding:

    curl http://localhost:11434

    Expected result: Ollama is running

gcloud equivalent — list GKE workloads:

gcloud container clusters describe <cluster-name> \
--region <region> \
--format="value(status)"

Phase 3 — List Available Models [MANUAL]

Goal: See which models are available in the Ollama instance.

  1. List all models currently installed:

    curl http://localhost:11434/api/tags

    Expected result: A JSON object with a models array listing installed models and their sizes. If default_model was set during deployment, it appears here.

  2. Format the output for readability:

    curl -s http://localhost:11434/api/tags | python3 -m json.tool
  3. Note the difference between models that are pre-downloaded (in GCS) and models that must still be pulled.

  4. Verify the GCS models bucket contains the model files:

    gcloud storage ls gs://<models_bucket>/

    Expected result: Directories corresponding to model names (e.g., blobs/, manifests/).


Phase 4 — Pull and Run a Model [MANUAL]

Goal: Pull a small model and generate a response.

  1. Pull a small model (gemma2:2b is ~1.6 GB and runs well on CPU):

    curl -X POST http://localhost:11434/api/pull \
    -d '{"name": "gemma2:2b"}'

    Expected result: A streaming JSON response showing download progress with status fields (pulling manifest, pulling..., verifying sha256 digest, success).

    For a 3B model expect 3–10 minutes download time depending on network speed. The model is written directly to the GCS Fuse mount and will persist across pod restarts.

  2. Once the pull is complete, run a prompt (non-streaming):

    curl http://localhost:11434/api/generate \
    -d '{
    "model": "gemma2:2b",
    "prompt": "Explain Kubernetes in one paragraph",
    "stream": false
    }'

    Expected result: A JSON response with a response field containing the generated text and metadata including eval_count and total_duration.

  3. Run a streaming prompt and observe the token-by-token output:

    curl http://localhost:11434/api/generate \
    -d '{
    "model": "gemma2:2b",
    "prompt": "What is the capital of France?",
    "stream": true
    }'

    Expected result: A stream of JSON objects, each with a response token, ending with "done": true.


Phase 5 — Chat API [MANUAL]

Goal: Use the OpenAI-compatible chat completions endpoint.

  1. Send a chat message using the OpenAI-compatible API:

    curl http://localhost:11434/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
    "model": "gemma2:2b",
    "messages": [
    {"role": "user", "content": "What is GCP?"}
    ]
    }'

    Expected result: A JSON response in OpenAI format with choices[0].message.content containing the answer.

  2. Send a multi-turn conversation:

    curl http://localhost:11434/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
    "model": "gemma2:2b",
    "messages": [
    {"role": "user", "content": "My name is Alex."},
    {"role": "assistant", "content": "Hello Alex! How can I help you today?"},
    {"role": "user", "content": "What is my name?"}
    ]
    }'

    Expected result: The model recalls the name Alex.

  3. Explore streaming with the chat API:

    curl http://localhost:11434/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
    "model": "gemma2:2b",
    "messages": [{"role": "user", "content": "List 3 GCP services"}],
    "stream": true
    }'

    Expected result: A stream of data: prefixed JSON chunks (SSE format), compatible with any OpenAI SDK client.


Phase 6 — Model Management [MANUAL]

Goal: Inspect running models and explore model metadata.

  1. List running models (models currently loaded in memory):

    curl http://localhost:11434/api/ps

    Expected result: A JSON object listing loaded models, their sizes, and when they were last used. Ollama keeps models in memory for a configurable duration (OLLAMA_KEEP_ALIVE).

  2. View detailed model information:

    curl http://localhost:11434/api/show \
    -d '{"name": "gemma2:2b"}'

    Expected result: A JSON object with modelfile, parameters, template, details (family, parameter size, quantization level), and model_info.

  3. List all locally available models again to see the newly pulled model:

    curl http://localhost:11434/api/tags
  4. Copy a model to create a custom variant:

    curl -X POST http://localhost:11434/api/copy \
    -d '{"source": "gemma2:2b", "destination": "gemma2-custom"}'
  5. Delete a model when no longer needed:

    curl -X DELETE http://localhost:11434/api/delete \
    -d '{"name": "gemma2-custom"}'

Phase 7 — Verify GCS Model Storage [MANUAL]

Goal: Confirm models persist in GCS across pod restarts.

  1. List the contents of the Ollama models bucket:

    gcloud storage ls gs://<models_bucket>/
  2. Browse model manifests and blobs:

    gcloud storage ls gs://<models_bucket>/manifests/
    gcloud storage ls gs://<models_bucket>/blobs/

    Expected result: Directories and files corresponding to the pulled models. The blobs/ directory contains the model weight files.

  3. Understand the GCS Fuse mount:

    The Ollama container mounts the GCS bucket at /root/.ollama/models using GCS Fuse CSI Driver. When Ollama writes model files, they are written directly to GCS. When the pod restarts, the models are available immediately without re-downloading.

  4. Test persistence by restarting the pod:

    kubectl rollout restart deployment/ollama -n ${NAMESPACE}
    kubectl rollout status deployment/ollama -n ${NAMESPACE}
  5. Re-establish port-forwarding after the pod restarts:

    kubectl port-forward svc/<service-name> 11434:11434 -n ${NAMESPACE}
  6. Verify models are still available:

    curl http://localhost:11434/api/tags

    Expected result: The same models are listed as before the restart, loaded from GCS.


Phase 8 — Explore Cloud Logging [MANUAL]

Goal: View Ollama server logs and model loading events.

  1. Open the Cloud Console Logs Explorer:

    https://console.cloud.google.com/logs/query?project=<project-id>
  2. Query Ollama container logs:

    resource.type="k8s_container"
    resource.labels.namespace_name="<namespace>"
    resource.labels.container_name="ollama"
  3. Look for log entries showing:

    • Server startup: Listening on [::]:11434
    • Model loading: llm_load_print_meta output when a model is first loaded
    • Request handling: inference start/complete events
  4. Using gcloud CLI:

    gcloud logging read \
    'resource.type="k8s_container" AND resource.labels.namespace_name="'${NAMESPACE}'"' \
    --project=<project-id> \
    --limit=50 \
    --format="table(timestamp,jsonPayload.message)"
  5. Watch logs in real time while running a prompt:

    kubectl logs -f deployment/ollama -n ${NAMESPACE}

Expected result: Log entries showing model loading from GCS Fuse and inference request handling.


Phase 9 — Explore Cloud Monitoring [MANUAL]

Goal: Inspect pod resource utilization metrics.

  1. Open the Cloud Console Monitoring dashboard:

    https://console.cloud.google.com/monitoring?project=<project-id>
  2. Navigate to Metrics Explorer and query:

    • Metric: kubernetes.io/container/cpu/request_utilization
    • Filter by namespace_name = ${NAMESPACE}
  3. Query memory utilization (important for LLM inference):

    • Metric: kubernetes.io/container/memory/used_bytes
    • Filter by namespace_name = ${NAMESPACE}
  4. Check HPA (Horizontal Pod Autoscaler) status:

    kubectl get hpa -n ${NAMESPACE}
    kubectl describe hpa -n ${NAMESPACE}
  5. Using gcloud CLI to list available GKE metrics:

    gcloud monitoring metrics list \
    --filter="metric.type=starts_with('kubernetes.io/container')" \
    --project=<project-id> \
    --limit=10

Expected result: CPU and memory graphs spiking during model inference, returning to baseline afterward.


Phase 10 — Undeploy [AUTOMATED]

When you have finished the lab, return to the RAD UI, navigate to your deployment, and click Undeploy (or Delete) to remove all resources provisioned by this module.

What is removed:

  • Kubernetes Deployment, Service, and namespace
  • GCS models bucket (if enable_purge = true) — note: this deletes all downloaded model weights
  • Artifact Registry mirrored image
  • Secret Manager secrets (if any)
  • Cloud Monitoring uptime checks and alert policies
  • Workload Identity bindings

Estimated time: 5–10 minutes

Resources provisioned by the Services GCP module (VPC, GKE cluster) are managed separately and must be undeployed via their own RAD UI deployment entry.


Summary

PhaseTypeWhat You Learned
Phase 1 — DeployAutomatedModule provisions GKE workload, GCS Fuse model storage, Workload Identity
Phase 2 — Connect to ClusterManualkubectl authentication, pod verification, and port-forwarding
Phase 3 — List Available ModelsManualDiscovering pre-pulled and available models
Phase 4 — Pull and Run a ModelManualDownloading a model and generating text via REST API
Phase 5 — Chat APIManualOpenAI-compatible chat completions, multi-turn conversations, streaming
Phase 6 — Model ManagementManualListing running models, viewing metadata, copying, and deleting models
Phase 7 — GCS Model StorageManualVerifying GCS persistence and testing pod restart durability
Phase 8 — Cloud LoggingManualViewing Ollama server logs and model load events
Phase 9 — Cloud MonitoringManualCPU/memory utilization during inference, HPA status
Phase 10 — UndeployAutomatedClean teardown of all resources