PCDE Certification Preparation Guide: Section 4 — Deploy scalable and highly available databases in Google Cloud (~20% of the exam)
This guide covers Section 4 of the Professional Cloud Database Engineer (PCDE) exam — and it is the section where this repository is the answer key. "Automate database instance provisioning" is literally what Services_GCP does: every Cloud SQL, AlloyDB, Redis, and Firestore deployment in the platform is declarative infrastructure-as-code applied through your deployment portal or Cloud Build. The HA, replica, and monitoring machinery comes from Services_GCP; application failover behavior is observed through App_CloudRun/App_GKE. Deploy the ha-production profile from the PCDE Lab Map before starting.
4.1 Apply concepts to implement scalable and highly available databases in Google Cloud
⏱ ~90 min · 💰 moderate-to-high while ha-production is applied (REGIONAL ≈ 2× instance cost; each replica ≈ +1 instance) — revert to ZONAL/no-replica afterwards · ⚙️ Requires: ha-production profile
Why the exam cares — Section 1 asked you to choose an HA design; Section 4 asks you to implement and prove it: provision the HA topology, deploy and scale read replicas, replicate across regions, verify failover actually works (an untested DR plan is the exam's favorite anti-pattern), automate provisioning so environments are reproducible, and monitor the HA signals (replication lag, failover events, instance health) rather than just CPU.
How RAD implements it —
Provisioning HA, declaratively. The entire topology is variables on Services_GCP:
| Capability | Variable (default) | Resulting resource |
|---|---|---|
| HA primary | postgres_database_availability_type (ZONAL) → REGIONAL | A Cloud SQL primary instance with synchronous standby + automatic failover (same for mysql_database_availability_type) |
| Read replicas | create_postgres_read_replica (false), postgres_read_replica_count (1) | A Cloud SQL read replica instance (type READ_REPLICA_INSTANCE), always ZONAL |
| Cross-region placement | availability_regions (["us-central1"]) | With ≥2 regions, replicas land in the second region; otherwise they stay in the primary region |
| Read-pool scale-out | enable_alloydb_read_pool (false), alloydb_read_pool_node_count (1, 1–20) | An AlloyDB read-pool instance (type READ_POOL) |
| Cache HA | redis_tier (BASIC) → STANDARD_HA | Memorystore with automatic failover replica; the platform blocks BASIC at plan time when resource_labels.environment = "production" |
Automated provisioning. This is infrastructure-as-code end to end: tofu init → plan → apply (run by the platform's create/update pipeline in CI), idempotent re-application, dependency sequencing (instances are gated on the Service Networking connection; a 120 s delay separates the two Cloud SQL instances), and discovery-not-duplication in the app layer (App_Common finds the platform instance by its managed-by = services-gcp label; App_CloudRun provisions an equivalent inline ZONAL PostgreSQL 17 instance only when none exists). Replica lifecycle is also codified: replicas are rebuilt if the primary is replaced.
Monitoring for HA databases. The platform ships CPU/memory/disk alert policies on resource.type = "cloudsql_database" wired to email channels (configure_email_notification, notification_alert_emails); each replica additionally publishes its endpoint as a <replica-name>-host secret so consumers fail over reads deliberately. Application-side, uptime_check_config (default { enabled = true, path = "/" }) creates a <service>-uptime-check synthetic probe plus failure alert whenever the application endpoint is publicly reachable — ready-made detection of user-visible impact during failover tests (internal-only deployments get none).
Try it
-
Apply ha-production and map the fleet:
gcloud sql instances list \
--format="table(name, region, gceZone, settings.availabilityType, instanceType, state)"Expect the primary as
REGIONALin us-central1 and the replica asREAD_REPLICA_INSTANCEin us-east1. -
Test HA — note the current zone, force a failover, and time it:
gcloud sql instances describe cloudsql-<prefix>-postgres --format="value(gceZone)"
gcloud sql instances failover cloudsql-<prefix>-postgres
gcloud sql operations list --instance=cloudsql-<prefix>-postgres --limit=3While it runs, hit the application URL (or watch the module-created
<service>-uptime-checkin Console > Monitoring > Uptime checks) to observe the brief connection blip — the Cloud SQL connector reconnects to the same connection name without any configuration change. -
Scale reads — raise
postgres_read_replica_countto2in the portal, apply, and confirm the new replica appears in the secondary region; then check replication health from the primary via psql:SELECT client_addr, state, replay_lag FROM pg_stat_replication;. -
Test DR promotion (destructive to the replica's replica-status — do it on the second, disposable replica):
gcloud sql instances promote-replica cloudsql-<prefix>-postgres-replica-1
gcloud sql instances describe cloudsql-<prefix>-postgres-replica-1 \
--format="value(instanceType, settings.availabilityType)"Note what Terraform now thinks: the promoted instance has drifted from the declared state, and the next
tofu planwill want to reconcile it — promotion is a break-glass action, not a managed workflow in these modules. -
Prove reproducibility — the automation claim of this section: re-run a plan over the unchanged deployment from your deployment portal (the platform runs
tofu planfor you) and review the proposed changes.You know it worked when the failover operation completes with the primary in a new zone, the promoted instance reports
CLOUD_SQL_INSTANCE(no longer a replica), and a freshtofu planover the unmodified configuration shows no unexpected changes (idempotence) — while the post-promotion plan visibly flags the drift.
Check yourself
Q1: During a failover test of the REGIONAL instance, the application reconnected automatically without any configuration change. Why — and which connection pattern from this platform made that possible?
A: Cloud SQL HA failover keeps the instance's identity: the connection name and private IP move to the promoted standby. Because the app connects through the Cloud SQL connector volume (Cloud Run) or Auth Proxy sidecar (GKE) addressed by connection name, and reads credentials from Secret Manager, nothing client-side referenced the failed zone. Hardcoded zonal IPs are the anti-pattern this design avoids.
Q2: A scenario requires read traffic served in two regions and a documented region-loss runbook. Which variables build the topology, and which two steps remain manual?
A: availability_regions = ["us-central1", "us-east1"], postgres_database_availability_type = "REGIONAL", create_postgres_read_replica = true, postgres_read_replica_count ≥ 1 — replicas are placed in the secondary region and their endpoints published as secrets. Manual in a disaster: promoting the replica (gcloud sql instances promote-replica) and repointing applications to the promoted endpoint (e.g. updating the host secret). Cross-region failover is never automatic for Cloud SQL — a recurring exam point.
Q3: Why is Terraform-based provisioning itself an HA control, not just a convenience?
A: Reproducibility is recoverability: the entire database estate (instances, flags, networks, secrets, alerting) can be re-created in another project or region from code with tofu apply, and configuration drift is detected by tofu plan. Manual console-built instances cannot be rebuilt reliably under incident pressure. The exam frames this as "automate instance provisioning" — IaC plus idempotent re-application is the expected answer.
Beyond the modules — Three gaps to study: (1) managed cross-region promotion workflows — the modules build the replica but have no promotion/runbook automation; read "Promoting replicas" and "Cross-region replicas for disaster recovery" in the Cloud SQL docs, including how to re-establish replication after promotion; (2) multi-region write systems — Spanner multi-region instance configurations and AlloyDB secondary clusters (gcloud alloydb clusters create-secondary) with switchover/failover semantics; (3) replication-lag alerting — the module alerts on CPU/memory/disk but not on cloudsql.googleapis.com/database/replication/replica_lag; practice adding that alert in Console > Monitoring > Alerting or via gcloud alpha monitoring policies create in a scratch project, since lag is the HA health signal for read-replica topologies.
⚠️ Exam trap — gcloud sql instances failover works only on REGIONAL (HA) instances — running it against a ZONAL instance fails because there is no standby. And promotion is one-way: a promoted replica is a standalone primary; to get a replica back you create a new one and reseed. Distractors that "fail back by demoting" the promoted instance are wrong for Cloud SQL.