Skip to main content

Crawl4AI on Google Cloud Run

This document provides a comprehensive reference for the modules/Crawl4AI_CloudRun Terraform module. It covers architecture, IAM, configuration variables, Crawl4AI-specific behaviours, and operational patterns for deploying Crawl4AI on Google Cloud Run (v2).


1. Module Overview

Crawl4AI is an open-source LLM-friendly web crawler and scraper with 40,000+ GitHub stars. It enables AI teams to rapidly ingest web content for RAG pipelines, knowledge bases, and monitoring without building custom extraction infrastructure. Crawl4AI CloudRun is a wrapper module built on top of App CloudRun. It uses App CloudRun for all GCP infrastructure provisioning and injects Crawl4AI-specific application configuration via Crawl4AI Common.

Key Capabilities:

  • Compute: Cloud Run v2 (Gen2 required), Python container, 4 vCPU / 8 Gi default. Supervisord manages two processes inside the container: embedded Redis (task queue, port 6379) and Gunicorn ASGI server (port 11235). Chromium/Playwright handles browser-based crawling.
  • Data Persistence: Stateless — no external database is provisioned (database_type = "NONE"). Redis runs inside the container and does not persist between restarts.
  • Security: Inherits Cloud Armor WAF, IAP, Binary Authorization, and VPC Service Controls from App CloudRun. No application secrets are auto-generated by Crawl4AI Common.
  • AI Integration: Supports LLM-based extraction via OpenAI, Anthropic, DeepSeek, Groq, Gemini, and custom providers. API keys are injected via secret_environment_variables.
  • CI/CD: Uses the prebuilt unclecode/crawl4ai image by default; Cloud Build custom image pipeline optional.

Project & Application Identity

VariableGroupTypeDefaultDescription
project_id1stringGCP project ID. Required.
tenant_deployment_id2string'demo'Short suffix appended to all resource names.
support_users2list(string)[]Email recipients for monitoring alerts.
resource_labels2map(string){}Labels applied to all provisioned resources.
application_name3string'crawl4ai'Base resource name.
application_display_name3string'Crawl4AI Web Crawler'Human-readable name shown in the GCP Console.
description3string(Crawl4AI description)Cloud Run service description.
application_version3string'latest'Docker image tag. Use a pinned tag (e.g., '0.6.0') for production.

Wrapper architecture: Crawl4AI CloudRun calls Crawl4AI Common to build a config object, then passes it as application_modules.crawl4ai to App CloudRun. No additional services are deployed alongside the main container.

Stateless design: Crawl4AI has no external database dependency. The embedded Redis instance stores task results in-memory with a configurable TTL (redis_task_ttl_seconds). Task results are lost on container restart — this is the expected behaviour for an ephemeral crawl API.


2. IAM & Access Control

Crawl4AI_CloudRun delegates all IAM provisioning to App_CloudRun. The Cloud Run SA, Cloud Build SA, and IAP service agent role sets are identical to those in App_CloudRun §2.

No auto-generated secrets: Crawl4AI Common creates no Secret Manager secrets. The SECRET_KEY for JWT authentication (if enabled) must be provided via secret_environment_variables:

secret_environment_variables = {
SECRET_KEY = "crawl4ai-jwt-secret"
}

JWT authentication: Security is disabled by default. To enable JWT authentication, set SECRET_KEY to a secure random value via secret_environment_variables AND set security.jwt_enabled=true in a custom config.yml. The /token endpoint then requires the security.api_token (set in config.yml) to issue short-lived JWTs.


3. Core Service Configuration

A. Compute (Cloud Run)

Crawl4AI requires Gen2 execution environment — supervisord needs a full Linux process tree, and Chromium requires /tmp for shared memory (via --disable-dev-shm-usage). The effective memory constraint is total container memory.

Container image: Crawl4AI_Common sets image_source = "prebuilt" and container_image = "unclecode/crawl4ai". Image mirroring is enabled by default to avoid Docker Hub rate limits.

VariableGroupDefaultDescription
deploy_application4trueSet false for IAM provisioning only without deploying the container.
cpu_limit4'4000m'CPU per instance. Size to match expected browser concurrency (~0.5–1 vCPU per active context).
memory_limit4'8Gi'Memory per instance. Minimum 4 Gi; 8 Gi recommended for multiple concurrent crawls.
min_instance_count41Minimum instances. Set to 1 for a warm Chromium pool. Set to 0 for cost savings at the cost of 30–60 s cold starts.
max_instance_count43Maximum concurrently running instances. Range: 1–1000.
execution_environment4'gen2'Required. Gen2 for supervisord process tree and Chromium /tmp memory.
timeout_seconds43600Maximum request duration. Set to 3600 (Cloud Run maximum) to allow long batch crawl jobs.
container_protocol4'http1'HTTP protocol version. 'http1' or 'h2c'.
enable_image_mirroring4trueMirrors the Crawl4AI image to Artifact Registry to avoid Docker Hub rate limits.
traffic_split4[]Percentage-based canary/blue-green traffic allocation.
service_annotations4{}Advanced Cloud Run annotations.
service_labels4{}Labels applied to the Cloud Run service.

Key differences from App CloudRun defaults:

VariableApp CloudRunCrawl4AI CloudRunReason
container_port808011235Crawl4AI REST API listens on 11235.
cpu_limit'1000m''4000m'Each Chromium browser context consumes ~0.5–1 vCPU.
memory_limit'512Mi''8Gi'Chromium requires significant memory; --disable-dev-shm-usage redirects shared memory to /tmp.
execution_environment'gen2''gen2' (required)supervisord and Chromium require Gen2.
timeout_seconds3003600Long batch crawls can take up to 30 minutes.
vpc_egress_setting'PRIVATE_RANGES_ONLY''ALL_TRAFFIC'The crawler must reach arbitrary public URLs on the internet.
database_type(varies)'NONE'Crawl4AI has no database dependency.

B. Crawl4AI-Specific Configuration

VariableGroupDefaultDescription
redis_task_ttl_seconds193600TTL in seconds for task results in embedded Redis. Prevents unbounded memory growth. Range: 300–86400.

C. Networking

VariableGroupDefaultDescription
ingress_settings5'all'Traffic sources permitted to reach the Cloud Run service. Use 'all' for public API access.
vpc_egress_setting5'ALL_TRAFFIC'Important: Use 'ALL_TRAFFIC' so the crawler can reach arbitrary public URLs.
enable_iap5falseEnable IAP for Google identity authentication.
iap_authorized_users5[]Users/service accounts granted access through IAP.
iap_authorized_groups5[]Google Groups granted access through IAP.
enable_vpc_sc5falseEnable VPC Service Controls perimeter enforcement.
vpc_cidr_ranges5[]VPC subnet CIDR ranges for VPC-SC network access level.
vpc_sc_dry_run5trueLog VPC-SC violations without blocking.
organization_id5""GCP Organization ID for VPC-SC policy.
enable_audit_logging5falseEnable detailed Cloud Audit Logs.
admin_ip_ranges5[]CIDR ranges permitted for administrative access.
enable_cloud_armor5falseEnable Cloud Armor WAF fronted by a Global HTTPS Load Balancer.
application_domains5[]Custom domain names for the Cloud Armor Load Balancer.
enable_cdn5falseEnable Cloud CDN on the Global HTTPS Load Balancer.

D. Environment Variables & LLM Integration

Crawl4AI supports LLM-based extraction via environment variable configuration:

VariableGroupDefaultDescription
environment_variables6{}Additional plain-text environment variables. PYTHONUNBUFFERED and REDIS_TASK_TTL are set automatically. Do NOT set REDIS_HOST or REDIS_PORT — Redis runs inside the container. Recognised overrides: LLM_PROVIDER, LLM_BASE_URL, LLM_TEMPERATURE, CRAWL4AI_HOOKS_ENABLED.
secret_environment_variables6{}Secret Manager secret references injected as environment variables. Use for: SECRET_KEY (JWT signing), OPENAI_API_KEY, ANTHROPIC_API_KEY, DEEPSEEK_API_KEY, GROQ_API_KEY, GEMINI_API_KEY, LLM_API_KEY.
secret_propagation_delay630Seconds to wait after secret creation. Range: 0–300.
secret_rotation_period6'2592000s'Rotation period for Secret Manager secrets. Default: 30 days.

Warning: CRAWL4AI_HOOKS_ENABLED=true enables arbitrary Python code execution via webhook hooks. Only enable in a fully trusted environment.


4. Advanced Security

A. Cloud Armor WAF

VariableGroupDefaultDescription
enable_cloud_armor5falseProvisions Global HTTPS LB + Cloud Armor WAF. Required for custom domains and DDoS protection.
admin_ip_ranges5[]CIDR ranges exempted from WAF rules.

B. Identity-Aware Proxy (IAP)

When enable_iap = true, Cloud Run's native IAP integration is enabled. All requests require a valid Google identity before reaching Crawl4AI.

C. Binary Authorization

VariableGroupDefaultDescription
enable_binary_authorization8falseEnforce Binary Authorization requiring signed container images.

5. Storage & Filesystem

Crawl4AI is stateless by design. Storage resources are optional:

VariableGroupDefaultDescription
create_cloud_storage11trueProvision GCS buckets defined in storage_buckets.
storage_buckets11[]Additional GCS buckets to provision (e.g., for crawl result caching). No buckets by default.
enable_nfs11falseProvision and mount a Cloud Filestore NFS volume. Not needed for standard Crawl4AI deployments.
nfs_mount_path11'/mnt/nfs'Filesystem path where the NFS volume is mounted.
nfs_instance_name11""Name of an existing NFS GCE VM. Leave empty to auto-discover.
nfs_instance_base_name11'app-nfs'Base name for the inline NFS GCE VM.
gcs_volumes11[]GCS buckets to mount as filesystem volumes via GCS Fuse.
manage_storage_kms_iam11falseCreate a CMEK KMS key for GCS encryption.
enable_artifact_registry_cmek11falseEnable CMEK encryption for Artifact Registry.

6. CI/CD & Delivery

VariableGroupDefaultDescription
enable_cicd_trigger8falseEnable a Cloud Build trigger for automated builds on GitHub pushes.
github_repository_url8""Full HTTPS URL of the GitHub repository.
github_token8""GitHub PAT for Cloud Build authentication. Sensitive.
github_app_installation_id8""Cloud Build GitHub App installation ID.
cicd_trigger_config8{ branch_pattern = "^main$" }Advanced Cloud Build trigger configuration.
enable_cloud_deploy8falseSwitch CI/CD to a managed Cloud Deploy pipeline with promotion stages.
cloud_deploy_stages8[dev, staging, prod(approval)]Ordered promotion stages for the Cloud Deploy pipeline.

7. Reliability & Scheduling

A. Health Probes

Crawl4AI exposes a /health HTTP endpoint. Supervisord boots Redis (priority 10) then Gunicorn (priority 20) before /health responds — allow at least 40 seconds of initial delay (matches the docker-compose start_period).

VariableGroupDefaultDescription
startup_probe14{ path="/health", initial_delay_seconds=40, period_seconds=10, failure_threshold=12, ... }Startup probe. 40 s initial delay for supervisord + Playwright/Chromium initialisation.
liveness_probe14{ path="/health", initial_delay_seconds=60, period_seconds=30, failure_threshold=3, ... }Liveness probe.
startup_probe_config14{ enabled=true, path="/health" }Alternative startup probe for App CloudRun. Takes precedence when both are set.
health_check_config14{ enabled=true, path="/health" }Alternative liveness probe for App CloudRun. Takes precedence when both are set.
uptime_check_config14{ enabled=true, path="/health" }Google Cloud Monitoring uptime check configuration.
alert_policies14[]Cloud Monitoring alert policies.
max_images_to_retain147Maximum number of container images to keep in Artifact Registry.
delete_untagged_images14trueAutomatically delete untagged container images from Artifact Registry.
image_retention_days1430Days after which container images are eligible for deletion.

B. Initialization & Cron Jobs

VariableGroupDefaultDescription
initialization_jobs9[]Cloud Run jobs to execute once during or after deployment. Defaults to CPU 4000m / memory 8Gi to match the service container.
cron_jobs9[]Recurring scheduled tasks deployed as Cloud Run jobs triggered by Cloud Scheduler.

8. Platform-Managed Behaviours

BehaviourImplementationDetail
No database provisioneddatabase_type = "NONE" in Crawl4AI CommonCrawl4AI has no external database dependency. Cloud SQL is not provisioned.
Embedded RedisSupervisord starts Redis (priority 10) inside the containerTask results stored in-memory. Lost on restart. Do not override REDIS_HOST or REDIS_PORT.
Gen2 requiredexecution_environment = "gen2"supervisord requires a full Linux process tree; Chromium uses /tmp for shared memory.
ALL_TRAFFIC egressvpc_egress_setting = "ALL_TRAFFIC"The crawler must reach arbitrary public URLs on the internet.
REDIS_TASK_TTL injectedREDIS_TASK_TTL = tostring(var.redis_task_ttl_seconds)Prevents unbounded Redis memory growth from accumulated task results.
PYTHONUNBUFFERED=1Injected by Crawl4AI CommonEnsures Python output is not buffered — important for log streaming.
No auto-generated secretssecret_ids = {} from Crawl4AI CommonNo secrets are created by default. Inject SECRET_KEY via secret_environment_variables to enable JWT auth.
Prebuilt image by defaultimage_source = "prebuilt"Uses unclecode/crawl4ai:<version> directly. Image mirroring copies it to Artifact Registry.

9. Variable Reference

VariableGroupDefaultDescription
module_description0(Crawl4AI platform text)Platform metadata: module description.
module_documentation0'https://docs.radmodules.dev/docs/modules/Crawl4AI_CloudRun'Platform metadata: documentation URL.
module_dependency0['Services GCP']Platform metadata: required modules.
module_services0(GCP service list)Platform metadata: GCP services consumed.
credit_cost050Platform metadata: deployment credit cost.
require_credit_purchases0falsePlatform metadata: enforces credit balance check.
enable_purge0truePermits full deletion of module resources on destroy.
public_access0falsePlatform catalogue visibility.
shared_users0[]Users who can access this module regardless of public_access. Enforced by the platform.
deployment_id0""Deployment ID suffix. Auto-generated if empty.
resource_creator_identity0(platform SA)Service account used by Terraform to manage resources.
impersonation_service_account0""SA to impersonate for shell script API calls.
project_id1GCP project ID. Required.
region1'us-central1'GCP region fallback.
tenant_deployment_id2'demo'Short suffix appended to all resource names.
support_users2[]Email addresses for monitoring alerts.
resource_labels2{}Labels applied to all provisioned resources.
application_name3'crawl4ai'Base resource name.
application_display_name3'Crawl4AI Web Crawler'Human-readable name.
description3(Crawl4AI description)Service description.
application_version3'latest'Docker image tag. Use a pinned tag for production.
redis_task_ttl_seconds193600TTL for embedded Redis task results. Range: 300–86400.
deploy_application4trueSet false for IAM-only deployment.
cpu_limit4'4000m'CPU per instance.
memory_limit4'8Gi'Memory per instance. Minimum 4 Gi.
min_instance_count41Minimum instances.
max_instance_count43Maximum instances. Range: 1–1000.
execution_environment4'gen2'Required. Gen2 for supervisord and Chromium.
timeout_seconds43600Max request duration. Range: 0–3600.
container_protocol4'http1''http1' or 'h2c'.
enable_image_mirroring4trueMirrors image into Artifact Registry.
traffic_split4[]Canary/blue-green traffic allocation.
service_annotations4{}Advanced Cloud Run annotations.
service_labels4{}Labels applied to the Cloud Run service.
ingress_settings5'all''all', 'internal', or 'internal-and-cloud-load-balancing'.
vpc_egress_setting5'ALL_TRAFFIC'Use 'ALL_TRAFFIC' for internet crawling.
enable_iap5falseEnables IAP authentication.
iap_authorized_users5[]IAP-authorized users/SAs.
iap_authorized_groups5[]IAP-authorized Google Groups.
enable_vpc_sc5falseVPC Service Controls perimeter enforcement.
vpc_cidr_ranges5[]VPC subnet CIDR ranges for VPC-SC.
vpc_sc_dry_run5trueLog-only mode for VPC-SC.
organization_id5""GCP Organization ID for VPC-SC.
enable_audit_logging5falseEnable Cloud Audit Logs.
admin_ip_ranges5[]Administrative CIDR ranges.
enable_cloud_armor5falseCloud Armor WAF + Global HTTPS LB.
application_domains5[]Custom domains for Cloud Armor LB.
enable_cdn5falseCloud CDN on the HTTPS LB backend.
cloudsql_volume_mount_path5'/cloudsql'Not used by Crawl4AI but required by App CloudRun interface.
environment_variables6{}Additional plain-text env vars. Do not set REDIS_HOST/REDIS_PORT.
secret_environment_variables6{}Secret Manager references. Use for SECRET_KEY, OPENAI_API_KEY, etc.
secret_propagation_delay630Seconds to wait after secret creation.
secret_rotation_period6'2592000s'Secret Manager rotation notification frequency.
backup_schedule7""Not applicable — Crawl4AI is stateless.
backup_retention_days77Days to retain backup files.
enable_backup_import7falseNot applicable for Crawl4AI.
backup_source7'gcs'Backup source.
backup_uri7""Backup file location.
backup_format7'sql'Backup file format.
enable_cicd_trigger8falseCloud Build GitHub trigger.
github_repository_url8""GitHub repository URL.
github_token8""GitHub PAT. Sensitive.
github_app_installation_id8""GitHub App installation ID.
cicd_trigger_config8{ branch_pattern = "^main$" }Cloud Build trigger config.
enable_cloud_deploy8falseCloud Deploy pipeline.
cloud_deploy_stages8[dev, staging, prod(approval)]Cloud Deploy promotion stages.
enable_binary_authorization8falseEnforce image attestation.
enable_custom_sql_scripts9falseNot applicable for Crawl4AI.
custom_sql_scripts_bucket9""GCS bucket for SQL scripts.
custom_sql_scripts_path9""Path prefix for SQL scripts.
custom_sql_scripts_use_root9falseRun SQL scripts as root user.
initialization_jobs9[]One-shot Cloud Run Jobs. Defaults to 4 vCPU / 8 Gi per job.
cron_jobs9[]Recurring scheduled Cloud Run Jobs.
create_cloud_storage11trueProvision GCS buckets.
storage_buckets11[]Additional GCS buckets (none by default for Crawl4AI).
enable_nfs11falseNFS volume mount. Not needed for standard Crawl4AI deployments.
nfs_mount_path11'/mnt/nfs'NFS mount path.
nfs_instance_name11""Existing NFS GCE VM name.
nfs_instance_base_name11'app-nfs'Base name for inline NFS VM.
gcs_volumes11[]GCS Fuse volume mounts.
manage_storage_kms_iam11falseCMEK for GCS.
enable_artifact_registry_cmek11falseCMEK for Artifact Registry.
database_type12'NONE'No database for Crawl4AI.
database_password_length1232Not used by Crawl4AI.
startup_probe_config14{ enabled=true, path="/health" }App CloudRun startup probe config.
health_check_config14{ enabled=true, path="/health" }App CloudRun liveness probe config.
uptime_check_config14{ enabled=true, path="/health" }Cloud Monitoring uptime check.
alert_policies14[]Cloud Monitoring alert policies.
startup_probe14{ path="/health", initial_delay_seconds=40, failure_threshold=12, ... }Startup probe forwarded to Crawl4AI Common.
liveness_probe14{ path="/health", initial_delay_seconds=60, failure_threshold=3, ... }Liveness probe forwarded to Crawl4AI Common.
max_images_to_retain147Maximum container images in Artifact Registry.
delete_untagged_images14trueAuto-delete untagged images.
image_retention_days1430Image age-based deletion threshold.

10. Outputs

OutputDescription
service_nameName of the Cloud Run service.
service_urlPublic URL of the Crawl4AI Cloud Run service.
service_locationGCP region where the Cloud Run service is deployed.
project_idGCP project ID.
deployment_idDeployment ID suffix used in resource names.
container_imageContainer image used for the deployment.
cicd_enabledWhether the CI/CD pipeline is enabled.

Configuration Pitfalls & Sensible Defaults

Risk levels: Critical (data loss, full outage, security breach) — High (service unavailable or significant degradation) — Medium (degraded function or increased cost) — Low (minor impact).

VariableSensible DefaultRiskConsequence of Incorrect Value
vpc_egress_setting"ALL_TRAFFIC"CriticalCrawl4AI crawls arbitrary public URLs on the internet. Using "PRIVATE_RANGES_ONLY" routes only RFC-1918 traffic through the VPC and blocks all external crawl targets. All crawl jobs to public websites will fail with connection errors. ALL_TRAFFIC is required and is the correct default.
memory_limit"8Gi"CriticalCrawl4AI spawns Chromium browser instances for JavaScript-rendered pages. Each concurrent browser context uses 200–500 MB. The default config allows up to 40 concurrent browser pages. Below 4Gi, Chromium processes are OOM-killed mid-crawl, returning partial or empty results. Below 2Gi, the container itself fails to start. 8Gi is the recommended minimum for production.
cpu_limit"4000m"HighChromium JavaScript rendering and DOM processing are CPU-intensive. Under 2000m, Chromium triggers internal timeouts on complex pages, and crawl times balloon. The default 4000m supports moderate concurrency; scale up for heavy parallel crawls.
execution_environment"gen2"HighCrawl4AI uses Direct VPC Egress (not a VPC connector). Direct VPC Egress is only available on Gen2. Downgrading to gen1 prevents the service from deploying with VPC network configuration.
min_instance_count1HighCrawl4AI has a significant cold start due to Chromium initialization and the embedded Redis/Supervisord stack. Scale-to-zero (0) causes the first request after a cold start to time out (30–60 seconds). Keep at 1 for responsive crawl APIs.
max_instance_count3MediumEach additional instance spawns its own Chromium pool and embedded Redis. At high concurrency, costs scale linearly with instance count. Set an explicit limit matching your concurrency budget.
timeout_seconds3600MediumDeep crawls or LLM-based extraction of large pages can take several minutes. The default 3600 seconds (1 hour) is intentionally high. Reduce for short-lived crawl APIs where zombie requests should be killed faster.
redis_task_ttl_seconds3600MediumCrawl4AI stores task results in its embedded Redis. Too-short TTL (< 300 s) causes completed task results to expire before clients poll for them. Too-long TTL (> 86400 s) causes unbounded memory growth from accumulated results. The valid range is 300–86400.
LLM_API_KEY (env var)(not set)HighLLM-based extraction strategies (e.g., LLMExtractionStrategy) require a valid API key. Setting an invalid or expired key causes extraction jobs to fail with 401 errors from the LLM provider. Inject via environment_variables or secret_environment_variables — never hardcode in plain text.
OPENAI_API_KEY / ANTHROPIC_API_KEY (env var)(not set)HighWhen using provider-specific extraction strategies, the corresponding API key must be present. Missing keys cause extraction to fail silently (empty or null extracted_content in results).
container_port11235CriticalCrawl4AI's REST API listens on port 11235. Changing this without a matching UVICORN_PORT environment variable causes health checks to fail, preventing the revision from receiving traffic.
enable_iapfalseHighThe default ingress_settings = "all" exposes the crawl API publicly. Without IAP or a crawl API token, any caller can submit crawl jobs, consuming cloud resources. Enable IAP or inject CRAWL4AI_API_TOKEN via environment variables.
application_version"latest"MediumUsing "latest" makes deployments non-reproducible. A rebuild may pull a new Crawl4AI version with breaking API changes. Pin to a specific version tag for production.
enable_image_mirroringtrueLowCrawl4AI images are large (~3–4 Gi compressed). Without mirroring to Artifact Registry, every Cloud Run deployment pulls from Docker Hub, risking rate limit failures and slow cold starts. Keep mirroring enabled.
enable_cicd_triggerfalseLowWhen enabled, ensure github_token and github_repository_url are correctly set. An invalid token silently prevents Cloud Build triggers from firing.

Destroying Resources

When destroying a Cloud Run deployment, you may encounter a serverless IPv4 address release error. Wait 20–30 minutes after the initial destroy attempt before re-running tofu destroy.