PDE Certification Preparation Guide: Section 5 — Optimizing performance and cost (~12% of the exam)
This guide helps candidates preparing for the Google Cloud Professional Cloud DevOps Engineer (PDE) certification explore Section 5 of the exam through the lens of the Tech Equity RAD platform at https://radmodules.dev. Three modules are relevant to this section: GCP Services, which establishes the foundational shared infrastructure; App CloudRun, which deploys serverless containerised applications on Cloud Run; and App GKE, which deploys containerised workloads on GKE Autopilot.
You interact with each module by configuring its variables in the RAD UI deployment portal, then exploring the resulting infrastructure in the GCP Console. This guide maps each exam topic to the relevant variables you can configure and the console locations where you can observe the outcomes. It also highlights PDE objectives that are not currently implemented by these modules, providing guidelines for self-guided research and exploration.
5.1 Collecting performance information in Google Cloud
Cloud Run Execution Environment and CPU Allocation
Concept: Selecting the right execution environment and CPU allocation model for each workload to eliminate performance bottlenecks and ensure consistent response latency.
In the RAD UI:
- Cloud Run Execution Environment: The
execution_environmentvariable (App CloudRun module) controls whether Cloud Run uses the Gen2 execution environment (full Linux compatibility, gVisor isolation, improved CPU performance for CPU-intensive workloads and faster startup for memory-intensive workloads) or Gen1 (legacy sandbox, lower baseline cost for simple stateless workloads). Gen2 is recommended for most new workloads and is required for workloads using Unix sockets or multi-threading. - CPU Allocation (Always-On vs. Request-Only): The
cpu_always_allocatedvariable (App CloudRun module, Group 3) controls whether the container's vCPU is allocated continuously while the instance is active.cpu_always_allocated = false(cost-optimized default) throttles the CPU to near-zero between requests — suitable for stateless request-response workloads but incompatible with background threads or persistent connections.cpu_always_allocated = truekeeps the CPU allocated at all times, enabling background processing such as cache warming, periodic aggregations, or PHP OPcache generation, but incurs continuous instance cost. Note that CPUs above 1 vCPU (1000m) requirecpu_always_allocated = true.
Console Exploration:
Navigate to Cloud Run > [service] > Revisions and click the active revision. In the Configuration tab, observe the configured execution environment (Gen1 or Gen2) and the CPU allocation setting (CPU always allocated vs. CPU only allocated during request processing). Navigate to Cloud Run > [service] > Metrics and review the CPU utilization chart — with cpu_always_allocated = true, you will see sustained CPU utilization even between requests, reflecting background processing activity.
Real-world example: A team migrates their PHP-based API from App Engine to Cloud Run using Gen1 with cpu_always_allocated = false (the default). They observe intermittent p99 latency spikes on cache-cold requests — caused by OPcache regeneration competing with request serving when the CPU is unthrottled at the start of each request. Switching to execution_environment = "gen2" and cpu_always_allocated = true eliminates the throttling, allowing OPcache to warm continuously in the background. The p99 latency at cold start drops from 2.1 seconds to 340 milliseconds, with no change to the application code.
💡 Additional Performance Collection Objectives & Learning Guidelines
- Cloud Profiler — Continuous Production Profiling: Research Cloud Profiler for identifying CPU-intensive code paths and memory allocation hotspots in production workloads without impacting user traffic (sampling overhead is typically <1% CPU). Add the Cloud Profiler agent to your Cloud Run or GKE container image (agents available for Go, Java, Node.js, Python, Ruby) and profiles are automatically uploaded to Cloud Profiler. Navigate to Profiler in the GCP Console to view flame graphs showing cumulative CPU time or heap allocation by function — enabling engineers to identify which function is responsible for 60% of CPU time without reproducing the load profile locally.
- Cloud Trace — Latency Analysis and Bottleneck Identification: Research Cloud Trace for end-to-end distributed tracing across microservices. Cloud Run services automatically generate trace spans for incoming requests when the
X-Cloud-Trace-Contextheader is present. For complete end-to-end traces, add the Cloud Trace SDK or OpenTelemetry SDK to propagate trace context through all downstream calls (Cloud SQL, Secret Manager, downstream Cloud Run services). Navigate to Trace > Trace list to filter traces by latency percentile — identify the slowest 1% of requests and inspect their waterfall diagrams to determine whether latency is concentrated in database queries, downstream API calls, or application processing. - Cloud Run Metrics for Performance Analysis: Navigate to Monitoring > Metrics Explorer and explore Cloud Run-specific performance metrics:
run.googleapis.com/request_latencies(request latency histogram by revision — use ALIGN_PERCENTILE_99 to surface tail latency),run.googleapis.com/container/cpu/utilizations(CPU utilization per revision — useful for identifying CPU saturation),run.googleapis.com/container/memory/utilizations(memory utilization — identify memory pressure before OOM kills), andrun.googleapis.com/container/startup_latencies(instance startup time — compare Gen1 vs Gen2 startup duration). Build a Cloud Monitoring dashboard combining these metrics to create a performance baseline for each deployed revision. - GKE Performance Diagnostics: For GKE Autopilot workloads, use the Kubernetes Engine > Workloads > [deployment] > Observability tab to view integrated CPU, memory, and network I/O metrics directly in the console without navigating to Cloud Monitoring. The Vertical Pod Autoscaler (VPA), enabled via
enable_vertical_pod_autoscaling = true(App GKE module, Group 3), analyses actual resource usage over time and recommends adjusted CPU and memory requests — reducing both resource waste (over-requested but unused) and OOM kill risk (under-requested and killed under load). Navigate to Kubernetes Engine > Workloads and check VPA recommendation objects using Cloud Shell:kubectl get vpa -n [namespace]to see the current recommended CPU and memory values alongside the currently configured requests.
5.2 Implementing FinOps practices for optimizing resource utilization and costs
Scale-to-Zero and Container Resource Rightsizing
Concept: Eliminating compute costs for idle workloads and precisely sizing resource allocations to maximize workload density and minimize waste.
In the RAD UI:
- Scale-to-Zero (Cloud Run): Setting
min_instance_count = 0(App CloudRun module) enables Cloud Run's scale-to-zero behaviour — when no traffic is being served, all instances terminate and no compute charges accrue. For workloads that can tolerate cold-start latency (typically 1–3 seconds for a pre-built container), scale-to-zero can reduce Cloud Run costs by 60–90% for workloads with uneven or low traffic patterns. Settingmin_instance_count = 1prevents cold starts but incurs a continuous minimum instance cost. - Container Resource Requests and Limits (GKE): The
container_resourcesvariable (App GKE module, Group 3) sets the CPU and memory requests and limits for each container. Requests determine the actual resources reserved on a GKE node for scheduling — setting requests too high wastes node capacity and inflates cluster cost; setting them too low causes CPU throttling and OOM kills. GKE Autopilot bills for the requested resources (not the node size), so precise resource requests directly control billing. - Vertical Pod Autoscaler for Automatic Right-Sizing (GKE): Setting
enable_vertical_pod_autoscaling = true(App GKE module, Group 3) enables the Vertical Pod Autoscaler, which analyses actual CPU and memory consumption over time and automatically adjusts the pod's resource requests to match observed usage. This eliminates both over-provisioning (paying for unused resources) and under-provisioning (OOM kills under load) without manual tuning. VPA is particularly valuable in the first weeks after a new deployment, when initial resource estimates are often inaccurate. Note that VPA and HPA should not both target CPU scaling simultaneously — use VPA when HPA scales on custom metrics (e.g. request throughput) rather than CPU. - Namespace ResourceQuota as a Cost Ceiling (GKE): The
enable_resource_quotavariable (App GKE module, Group 15) creates a KubernetesResourceQuotaobject in the application namespace, setting hard upper bounds on aggregate CPU requests, memory requests, and pod count across all workloads in that namespace. In a multi-tenant cluster where multiple application teams share nodes, ResourceQuota prevents any single namespace from monopolising cluster capacity — which directly prevents runaway scaling from driving unexpected cost spikes. Key quota variables:quota_cpu_limitscaps total CPU bursting,quota_memory_limitscaps total memory, andquota_max_podslimits how many pods (including job pods) can exist simultaneously. Size quotas based on your expected peak pod count × per-pod requests — a quota set too low will block new pods from scheduling during deployments or cron job execution.
Console Exploration:
Navigate to Cloud Run > [service] > Revisions and inspect the minimum and maximum instance count settings. Navigate to Cloud Run > [service] > Metrics and review the Instance count chart over time — observe the scale-to-zero behaviour (instance count dropping to 0 during quiet periods) and scale-up events (instance count rising as requests arrive). For GKE, navigate to Kubernetes Engine > Workloads > [deployment], select a pod, and review the YAML tab to confirm the resources.requests and resources.limits values. Navigate to Kubernetes Engine > Clusters > [cluster] > Observability to see cluster-wide CPU and memory utilization versus requested capacity.
Real-world example: A startup runs a data transformation Cloud Run service that processes files uploaded by customers during business hours only. With min_instance_count = 0 and max_instance_count = 10, the service runs zero instances from 22:00 to 08:00 and on weekends — eliminating compute costs for 70% of the calendar week. The team further tunes concurrency = 80 (processing up to 80 concurrent file transformation requests per instance), reducing the peak instance count from 10 to 3 for typical load. Combined, these settings reduce their Cloud Run bill by 83% compared to their previous always-on App Engine deployment.
💡 Additional FinOps Objectives & Learning Guidelines
- Cloud Billing Export to BigQuery: Research how to export Cloud Billing data to BigQuery for granular cost analysis. Navigate to Billing > Billing export > BigQuery export to configure detailed usage cost export (SKU-level granularity, including per-resource labels) and pricing export (list and contract prices). Once exported, run SQL queries in BigQuery to analyze cost by project, service, label, or SKU — for example, identifying which Cloud Run service consumed the most CPU-seconds last month, or which GKE namespace generated the most egress cost. This is the foundation of any FinOps practice on Google Cloud.
- Cost Breakdown and Billing Reports: Navigate to Billing > Reports to explore the built-in billing reports dashboard. Use the Group by and Filter controls to break down costs by service, project, region, SKU, and resource label. Navigate to Billing > Cost breakdown to understand the net cost after committed use discount credits and sustained use discounts are applied. Understanding the difference between list price, credits, and net cost is essential for accurate FinOps reporting and showback/chargeback to internal teams.
- Recommender — Rightsizing and Idle Resource Detection: Research the Recommender service, which uses machine learning to analyze 30 days of usage data and surface actionable cost-saving recommendations. Navigate to Recommender in the console (or Active Assist in some console views). Key recommendation types to know: VM machine type rightsizing (suggests downsizing over-provisioned Compute Engine instances), idle VM recommendations (identifies VMs with <5% average CPU for 14 days), idle GKE cluster recommendations, and overprovisioned Cloud Run CPU recommendations. Recommendations include an estimated monthly saving and a risk level — apply low-risk recommendations immediately and investigate high-risk recommendations before acting.
- Committed Use Discounts (CUDs) for GKE: Research how Committed Use Discounts provide 37% (1-year) to 55% (3-year) savings on GKE Autopilot vCPU and memory charges in exchange for a spending commitment. Unlike Compute Engine CUDs (which commit to specific machine types), GKE Autopilot CUDs commit to a resource amount (vCPU-hours and GB-hours) that can be consumed by any pod across any namespace and workload. Navigate to Billing > Commitments to explore available commitment options and their discount rates. Research the GKE usage metering feature, which attributes cluster costs to namespaces and labels — enabling commitment sizing based on actual per-team or per-application consumption.
- Budget Alerts: Research how to create Cloud Billing budgets to alert when actual or forecasted spend approaches a threshold. Navigate to Billing > Budgets & alerts > Create budget. Configure a budget scoped to a specific project or service (e.g., Cloud Run only), set threshold rules (e.g., alert at 50%, 90%, and 100% of monthly budget), and connect a Pub/Sub topic to trigger automated responses (e.g., a Cloud Function that scales down non-production services when 90% of the budget is consumed). Budget alerts are reactive controls — pair them with Recommender's proactive recommendations for complete FinOps coverage.
- Spot VMs for Batch GKE Workloads: Research Spot VMs for GKE node pools running fault-tolerant batch workloads (data processing, ML training, CI/CD build agents). Spot VMs are priced at 60–91% discount versus standard VMs but can be preempted by Google with a 25-second shutdown notification when capacity is needed elsewhere. Unlike the deprecated Preemptible VMs (which had a 24-hour maximum runtime), Spot VMs have no maximum lifespan — they run until preempted. In GKE, configure a Spot node pool with
spot: truein the node pool spec and use Kubernetes tolerations andnodeSelectorto schedule appropriate workloads onto Spot nodes while keeping latency-sensitive services on standard on-demand nodes. - Cloud Run Cost Optimization — Concurrency Tuning: Research the relationship between Cloud Run concurrency and cost. Higher
concurrencysettings (requests handled simultaneously per instance) reduce instance count for the same throughput — directly reducing compute costs. However, very high concurrency on CPU-bound workloads degrades latency as requests compete for the same vCPU. Research the Cloud Run concurrency guidance: I/O-bound workloads (waiting on database, API calls) benefit from high concurrency (80–1000); CPU-bound workloads (image processing, ML inference) benefit from lower concurrency (1–10) with more instances. Navigate to Cloud Run > [service] > Edit > Container, Networking, Security > Capacity to adjust concurrency and model the cost impact using the pricing calculator.