Skip to main content

Crawl4AI Common Shared Configuration Module

The Crawl4AI Common module defines the Crawl4AI web crawler configuration for the RAD Modules ecosystem. It is a pure configuration module — it creates no GCP resources and produces a config output consumed by platform-specific wrapper modules (Crawl4AI CloudRun and Crawl4AI GKE).

1. Overview

Purpose: To centralise all Crawl4AI-specific configuration (prebuilt container image, embedded Redis + Gunicorn supervisord setup, environment variable mapping, health probes, and the stateless no-database architecture) in a single module shared by both Cloud Run and GKE deployments.

Architecture:

Layer 3: Application Wrappers
├── Crawl4AI_CloudRun ──┐
└── Crawl4AI_GKE ──┤── instantiate Crawl4AI_Common

Crawl4AI_Common (this module)
Creates: (no GCP resources)
Produces: config, secret_ids, storage_buckets

Layer 2: Platform Modules
├── App_CloudRun (serverless deployment)
└── App_GKE (Kubernetes deployment)

Layer 1: App_Common (networking, storage, secrets, IAM)

Key characteristics:

  • No external databasedatabase_type = "NONE". Cloud SQL is not provisioned.
  • Creates no GCP resources — no secrets, no IAM bindings, no storage buckets.
  • Embedded Redis — supervisord starts Redis (priority 10) then Gunicorn (priority 20) inside the same container. Redis listens on localhost:6379 and must NOT be overridden via environment variables.
  • Chromium memory management — the default config.yml includes --disable-dev-shm-usage in Chrome's extra_args. On Cloud Run, Chromium redirects shared memory to /tmp; on GKE, a proper /dev/shm emptyDir volume is mounted by App GKE.
  • secret_ids returns an empty map — no secrets are auto-generated.
  • storage_buckets returns an empty list — no GCS buckets are auto-provisioned.

2. Outputs

config

The application configuration object passed to the platform module via application_config.

FieldValue / Description
app_namevar.application_name
display_namevar.application_display_name
descriptionvar.description
container_image"unclecode/crawl4ai"
application_versionvar.application_version (default: "latest")
image_source"prebuilt" — uses the upstream Docker Hub image directly
enable_image_mirroringvar.enable_image_mirroring (default: true)
container_build_configenabled = false — no Cloud Build step
container_port11235 — Crawl4AI REST API port
database_type"NONE" — no external database
db_name""
db_user""
enable_cloudsql_volumefalse — no Cloud SQL sidecar
cloudsql_volume_mount_path"/cloudsql" (unused)
gcs_volumesvar.gcs_volumes
enable_postgres_extensionsfalse
postgres_extensions[]
container_resourcesSee below
min_instance_countvar.min_instance_count
max_instance_countvar.max_instance_count
startup_probevar.startup_probe (HTTP /health, 40 s delay)
liveness_probevar.liveness_probe (HTTP /health, 60 s delay)
initialization_jobsvar.initialization_jobs
additional_services[]

container_resources

When container_resources is provided directly, it takes precedence over cpu_limit and memory_limit. The merged object includes:

FieldDefault
cpu_limitvar.cpu_limit ("4000m" on Cloud Run, "4" on GKE)
memory_limitvar.memory_limit ("8Gi")
cpu_requestnull
mem_requestnull
ephemeral_storage_requestnull
ephemeral_storage_limitnull

environment_variables (within config)

VariableValueDescription
PYTHONUNBUFFERED"1"Ensures Python log output is not buffered
REDIS_TASK_TTLtostring(var.redis_task_ttl_seconds)TTL for task results in embedded Redis

Additional environment variables from var.environment_variables are merged after the above defaults.

Do NOT override REDIS_HOST or REDIS_PORT — these must remain at localhost/6379 to connect to the bundled Redis instance inside the container.

secret_ids

Empty map — Crawl4AI Common creates no secrets. Use secret_environment_variables in the wrapper module to inject SECRET_KEY and LLM API keys.

storage_buckets

Empty list — no GCS buckets are auto-provisioned.


3. Variables

VariableTypeDefaultDescription
project_idstringGCP project ID.
wrapper_prefixstringPrefix for GCS bucket resource naming. Must match the resource_prefix used by the calling module.
deployment_idstring""Unique deployment ID.
common_labelsmap(string){}Labels to apply to resources.
regionstring"us-central1"GCP region for resource deployment.
application_namestring"crawl4ai"Application name used in resource naming.
application_display_namestring"Crawl4AI Web Crawler"Human-readable application name.
descriptionstring(Crawl4AI description)Application description.
application_versionstring"latest"Crawl4AI Docker image tag.
redis_task_ttl_secondsnumber3600TTL for task results in embedded Redis. Range: 300–86400.
cpu_limitstring"4000m"CPU limit for the container.
memory_limitstring"8Gi"Memory limit for the container. Minimum 4 Gi.
container_resourcesanynullFull container resources override. Takes precedence over cpu_limit/memory_limit.
min_instance_countnumber1Minimum number of instances.
max_instance_countnumber3Maximum number of instances.
gcs_volumeslist(any)[]Additional GCS volume mounts.
environment_variablesmap(string){}Additional environment variables. PYTHONUNBUFFERED and REDIS_TASK_TTL are set automatically. Do NOT override REDIS_HOST or REDIS_PORT.
initialization_jobslist(any)[]Custom initialisation jobs.
startup_probeobject(HTTP /health, 40 s delay)Startup probe configuration.
liveness_probeobject(HTTP /health, 60 s delay)Liveness probe configuration.
enable_image_mirroringbooltrueMirror the Crawl4AI image to Artifact Registry.

4. Recognised Environment Variables

The following environment variables are recognised by Crawl4AI at runtime (sourced from server.py, utils.py, and auth.py):

VariableDescription
SECRET_KEYJWT signing secret (default: "mysecret"). Override via secret_environment_variables for production.
REDIS_PASSWORDRedis auth password (default: "" — no password for embedded Redis).
REDIS_TASK_TTLTTL in seconds for task data in Redis (default: 3600). Set automatically by this module.
LLM_PROVIDEROverride the default LLM provider (e.g., "anthropic/claude-3-haiku").
LLM_API_KEYSet the LLM API key. Prefer provider-specific keys below.
LLM_BASE_URLOverride the LLM API base URL (for proxy or custom endpoints).
LLM_TEMPERATUREOverride LLM sampling temperature.
OPENAI_API_KEYOpenAI API key for extraction tasks.
ANTHROPIC_API_KEYAnthropic API key for extraction tasks.
DEEPSEEK_API_KEYDeepSeek API key for extraction tasks.
GROQ_API_KEYGroq API key for extraction tasks.
GEMINI_API_KEYGoogle Gemini API key for extraction tasks.
CRAWL4AI_HOOKS_ENABLEDEnable custom hook execution (default: "false"). Warning: RCE risk. Only enable in fully trusted environments.

5. Internal Process Architecture

Container startup
└── supervisord (PID 1)
├── [priority=10] Redis server → localhost:6379
│ (task queue, result store)
└── [priority=20] Gunicorn → 0.0.0.0:11235
└── 1 worker × 4 threads
└── FastAPI (crawl4ai.server)
├── POST /crawl (async crawl job)
├── GET /task/{id} (task status & result)
├── POST /crawl/sync (synchronous crawl)
├── GET /health (health check)
└── GET /playground (interactive UI)

Chromium is launched on-demand per crawl request. The default config.yml sets crawler.pool.max_pages = 40 (maximum concurrent browser pages per container instance).