Skip to main content

Crawl4AI on Google Kubernetes Engine (GKE Autopilot)

This document provides a comprehensive reference for the modules/Crawl4AI_GKE Terraform module. It covers architecture, IAM, configuration variables, Crawl4AI-specific behaviours, and operational patterns for deploying Crawl4AI on GKE Autopilot.


1. Module Overview

Crawl4AI is an open-source LLM-friendly web crawler and scraper with 40,000+ GitHub stars. Crawl4AI GKE is a wrapper module built on top of App GKE. It uses App GKE for all GCP infrastructure provisioning and injects Crawl4AI-specific application configuration via Crawl4AI Common.

Key Capabilities:

  • Compute: GKE Autopilot, Python container, 4 vCPU / 8 Gi default. Supervisord manages embedded Redis (task queue) and Gunicorn ASGI server inside the pod. A dedicated /dev/shm emptyDir volume is mounted for Chromium shared memory — GKE provides proper /dev/shm support unlike Cloud Run's tmpfs approach.
  • Data Persistence: Stateless — no external database is provisioned. Redis runs inside the pod. Horizontal Pod Autoscaler (HPA) manages scaling.
  • Security: Inherits Workload Identity, Cloud Armor WAF, IAP, Binary Authorization, and VPC Service Controls from App GKE. No application secrets are auto-generated.
  • AI Integration: Supports LLM-based extraction via OpenAI, Anthropic, DeepSeek, Groq, Gemini, and custom providers. API keys are injected via secret_environment_variables.

Architecture note: On GKE, Crawl4AI benefits from proper /dev/shm support via an emptyDir volume, unlike the Cloud Run Gen2 workaround that redirects Chromium to /tmp. This makes GKE the preferred platform for high-concurrency crawling workloads.


2. IAM & Access Control

Crawl4AI GKE delegates all IAM provisioning to App GKE. Workload Identity binds the Kubernetes service account to a GCP service account for Secret Manager access.

No auto-generated secrets: Crawl4AI Common creates no Secret Manager secrets. Inject SECRET_KEY (for JWT authentication) and LLM API keys via secret_environment_variables.


3. Core Service Configuration

A. Compute (GKE)

VariableGroupDefaultDescription
deploy_application4trueSet false for infrastructure-only deployment.
workload_type4null'Deployment' (stateless, default) or 'StatefulSet'.
container_resources4{ cpu_limit = "4", memory_limit = "8Gi", cpu_request = "2", mem_request = "4Gi" }Container CPU and memory limits and requests. Minimum 4 Gi memory.
min_instance_count41Minimum pod replicas. Set to 1 for a warm Chromium pool.
max_instance_count45Maximum pod replicas for HPA. Range: 1–1000.
timeout_seconds41800Pod termination grace period. Set to at least 1800 s to allow long batch crawls to drain.
termination_grace_period_seconds460Seconds Kubernetes waits for the pod to terminate gracefully.
service_type4'LoadBalancer'Kubernetes Service type: 'ClusterIP', 'LoadBalancer', or 'NodePort'.
session_affinity4'None''None' distributes requests across all pods.
enable_image_mirroring4trueMirror Crawl4AI image to Artifact Registry.
enable_vertical_pod_autoscaling4falseEnable VPA for automatic resource adjustment.
container_image_source4'prebuilt''prebuilt' uses unclecode/crawl4ai directly; 'custom' builds via Cloud Build.
container_image4'unclecode/crawl4ai'Full URI of the container image when container_image_source = 'prebuilt'.

B. Crawl4AI-Specific Configuration

VariableGroupDefaultDescription
redis_task_ttl_seconds193600TTL in seconds for task results in embedded Redis. Range: 300–86400.

C. Application Identity

VariableGroupDefaultDescription
application_name3'crawl4ai'Internal identifier for the application.
application_display_name3'Crawl4AI Web Crawler'Human-readable name shown in the platform UI.
application_description3(Crawl4AI GKE description)Brief description of the application's purpose.
application_version3'latest'Crawl4AI Docker image tag. Use a pinned version for production.

D. Networking

VariableGroupDefaultDescription
enable_iap20falseEnable IAP via Kubernetes Gateway. Requires enable_custom_domain = true.
iap_authorized_users20[]User emails authorized via IAP.
iap_authorized_groups20[]Google Groups authorized via IAP.
iap_oauth_client_id20""OAuth client ID for IAP. Sensitive.
iap_oauth_client_secret20""OAuth client secret for IAP. Sensitive.
enable_custom_domain19falseEnable custom domain via Kubernetes Gateway API with SSL certificates.
application_domains19[]Custom domains for the application.
reserve_static_ip19trueReserve a static external IP.
static_ip_name19""Name for the reserved static IP. Auto-generated when empty.
network_tags19['nfsserver']Network tags applied to GKE nodes.
gke_cluster_name6""GKE cluster name. Leave empty to auto-discover.
namespace_name6""Kubernetes namespace. Auto-generated when empty.

E. Environment Variables & LLM Integration

VariableGroupDefaultDescription
environment_variables5{}Additional environment variables. PYTHONUNBUFFERED and REDIS_TASK_TTL are set automatically. Do NOT set REDIS_HOST or REDIS_PORT.
secret_environment_variables5{}Secret Manager secret references. Use for SECRET_KEY, OPENAI_API_KEY, ANTHROPIC_API_KEY, etc.
secret_propagation_delay530Seconds to wait after secret creation.
secret_rotation_period5'2592000s'Secret Manager rotation period. Default: 30 days.

4. Advanced Security

VariableGroupDefaultDescription
enable_binary_authorization12falseEnable Binary Authorization requiring signed images.
enable_cloud_armor21falseAttach a Cloud Armor security policy to the GKE Ingress backend.
admin_ip_ranges21[]CIDR ranges permitted for administrative access.
cloud_armor_policy_name21'default-waf-policy'Name of the Cloud Armor security policy to apply.
enable_cdn21falseEnable Cloud CDN on the load balancer.
enable_vpc_sc22falseVPC Service Controls perimeter enforcement.
vpc_cidr_ranges22[]VPC subnet CIDR ranges for VPC-SC.
vpc_sc_dry_run22trueLog VPC-SC violations without blocking.
organization_id22""GCP Organization ID for VPC-SC.
enable_audit_logging22falseEnable detailed Cloud Audit Logs.

5. Storage

VariableGroupDefaultDescription
create_cloud_storage14trueProvision GCS buckets.
storage_buckets14[]Additional GCS buckets (none by default).
gcs_volumes14[]GCS FUSE volume mounts via CSI driver.
manage_storage_kms_iam14falseCreate CMEK KMS key for GCS encryption.
enable_artifact_registry_cmek14falseEnable CMEK for Artifact Registry.
max_images_to_retain147Maximum container images to keep in Artifact Registry.
delete_untagged_images14trueAuto-delete untagged images.
image_retention_days1430Image age-based deletion threshold.

6. CI/CD & Delivery

VariableGroupDefaultDescription
enable_cicd_trigger12falseEnable automated Cloud Build trigger.
github_repository_url12""GitHub repository URL.
github_token12""GitHub PAT. Sensitive.
github_app_installation_id12""GitHub App installation ID.
cicd_trigger_config12{ branch_pattern = "^main$" }Cloud Build trigger configuration.
enable_cloud_deploy12falseGoogle Cloud Deploy managed pipeline.
cloud_deploy_stages12[dev, staging, prod(approval)]Cloud Deploy promotion stages.

7. Reliability & Scheduling

A. Health Probes

VariableGroupDefaultDescription
startup_probe14{ path="/health", initial_delay_seconds=40, failure_threshold=12, ... }Startup probe for Crawl4AI. 40 s initial delay for supervisord boot.
liveness_probe14{ path="/health", initial_delay_seconds=60, failure_threshold=3, ... }Liveness probe.
startup_probe_config10{ enabled=true }App GKE startup probe config. Takes precedence.
health_check_config10{ enabled=true }App GKE liveness probe config. Takes precedence.
uptime_check_config10{ enabled=false }Cloud Monitoring uptime check.
alert_policies10[]Cloud Monitoring alert policies.
deployment_timeout61800Maximum seconds Terraform waits for the Kubernetes rollout to complete.
enable_pod_disruption_budget4falseCreate a PodDisruptionBudget.
pdb_min_available41Minimum pods available during disruptions.

B. Backup & Scheduled Jobs

VariableGroupDefaultDescription
backup_schedule17""Not applicable — Crawl4AI is stateless.
backup_retention_days177Days to retain backup files.
enable_backup_import17falseNot applicable for Crawl4AI.
initialization_jobs11[]Kubernetes Jobs for initialization tasks.
cron_jobs11[]CronJobs to deploy alongside Crawl4AI.

8. Platform-Managed Behaviours

BehaviourImplementationDetail
No database provisioneddatabase_type = "NONE" in Crawl4AI CommonCrawl4AI has no external database dependency.
Embedded RedisSupervisord starts Redis inside the podTask results stored in-memory. Lost on pod restart.
/dev/shm supportemptyDir volume mounted by App GKEGKE provides proper shared memory for Chromium — no --disable-dev-shm-usage workaround needed.
REDIS_TASK_TTL injectedREDIS_TASK_TTL = tostring(var.redis_task_ttl_seconds)Prevents unbounded Redis memory growth.
PYTHONUNBUFFERED=1Injected by Crawl4AI CommonEnsures Python log streaming.
Prebuilt image by defaultimage_source = "prebuilt"Uses unclecode/crawl4ai:<version> directly via Artifact Registry mirror.
No auto-generated secretssecret_ids = {} from Crawl4AI CommonInject SECRET_KEY via secret_environment_variables to enable JWT auth.
Workload IdentityManaged by App GKEPod accesses GCP APIs via Workload Identity — no service account key files.

9. Outputs

OutputDescription
service_nameName of the Kubernetes Service.
service_urlExternal URL of the Crawl4AI service.
project_idGCP project ID.
deployment_idDeployment ID suffix used in resource names.
container_imageContainer image used for the deployment.
cicd_enabledWhether the CI/CD pipeline is enabled.

Configuration Pitfalls & Sensible Defaults

Risk levels: Critical (data loss, full outage, security breach) — High (service unavailable or significant degradation) — Medium (degraded function or increased cost) — Low (minor impact).

VariableSensible DefaultRiskConsequence of Incorrect Value
memory_limit"8Gi"CriticalCrawl4AI spawns Chromium browser instances for JavaScript rendering. Each concurrent browser context uses 200–500 MB. Below 4Gi, Chromium processes are OOM-killed mid-crawl returning partial results. Below 2Gi, the container fails to start. Scale to 16Gi+ for high-concurrency GKE deployments.
cpu_limit"4000m"HighChromium rendering and DOM parsing are CPU-intensive. CPU throttling below 2000m causes internal browser timeouts on complex pages and significantly slows crawl throughput.
/dev/shm for Chromium (GKE emptyDir)(must be configured via emptyDir medium: Memory in pod spec)HighOn GKE, Chromium by default uses /dev/shm (shared memory) for inter-process communication. The default /dev/shm size in Kubernetes is 64 Mi, which is insufficient for Chromium. Crawl4AI's default config uses --disable-dev-shm-usage (Chrome uses /tmp instead), but if this flag is removed, configure an emptyDir volume with medium: Memory mounted at /dev/shm with adequate size. Insufficient /dev/shm causes browser crashes.
min_instance_count1HighCrawl4AI has a significant cold start (Chromium + embedded Redis + Supervisord). Scale-to-zero (0) means the first request after a cold start encounters a 30–60 second delay and likely times out. Keep at 1 in production.
max_instance_count3MediumEach GKE pod runs its own Chromium pool and embedded Redis. Costs scale with pod count. Set a ceiling matching your crawl concurrency budget.
redis_task_ttl_seconds3600MediumTask results in the embedded Redis expire after this TTL. Too-short values (< 300 s) cause results to expire before async clients poll for them. Too-long values cause memory growth. Valid range: 300–86400.
workload_typenullMediumCrawl4AI is stateless — Deployment is appropriate. Using StatefulSet without stateful_pvc_enabled = true wastes scheduler resources. Use Deployment unless local PVC caching is explicitly needed.
quota_memory_requests"32Gi"CriticalMust use binary unit suffixes (Gi, Mi). A bare integer (e.g. "32") is treated as 32 bytes by Kubernetes, blocking all pod scheduling. Only active when enable_resource_quota = true.
quota_memory_limits"64Gi"CriticalSame constraint as quota_memory_requests — binary suffixes required. If set below the actual pod memory limit × replica count, pods fail to schedule.
LLM_API_KEY (env var via environment_variables)(not set)HighLLM-based extraction strategies require a valid provider API key injected as an environment variable. Missing or expired keys cause extraction jobs to return empty extracted_content. Use secret_environment_variables for production to avoid plain-text exposure.
container_port11235CriticalCrawl4AI listens on port 11235. Changing this without a matching UVICORN_PORT or Kubernetes Service port update causes health probes to fail and the service to receive no traffic.
timeout_seconds300MediumDeep crawls of complex pages can exceed 5 minutes. Increase to 6003600 for workloads involving JavaScript-heavy sites or LLM-based extraction.
enable_iapfalseHighWithout IAP, the GKE LoadBalancer is accessible to any caller. Enable IAP or inject CRAWL4AI_API_TOKEN via environment variables for production deployments.
application_version"latest"MediumUsing "latest" is non-reproducible. Pin to a specific version tag to prevent unexpected API changes on rebuild.
enable_image_mirroringtrueLowCrawl4AI images are large. Disable only if Artifact Registry already holds the correct image; otherwise every pod start pulls from Docker Hub and risks rate-limit failures.