Skip to main content

Crawl4AI on GKE — Lab Guide

📖 Configuration Guide

Overview

Estimated time: 1–2 hours

Crawl4AI is an open-source LLM-friendly web crawler and scraper. This lab deploys Crawl4AI on GKE Autopilot with supervisord managing embedded Redis (task queue) and Gunicorn (ASGI server) inside the container. Playwright/Chromium handles browser-based crawling with proper /dev/shm shared-memory support via emptyDir volume. No external database is required — Crawl4AI is fully stateless.

What the Module Automates

  • Kubernetes Deployment with Crawl4AI container and emptyDir for /dev/shm
  • Artifact Registry repository and image mirror from Docker Hub
  • Kubernetes namespace, service accounts, and RBAC
  • Workload Identity binding for secure GCP API access
  • Horizontal Pod Autoscaler (HPA) for replica scaling
  • Kubernetes Service (LoadBalancer) for external access
  • Cloud Monitoring uptime checks and alert policies
  • Optional: Secret Manager secrets for LLM API keys

What You Do Manually

  • Configure kubectl to connect to the GKE cluster
  • Confirm the /health endpoint responds
  • Submit crawl jobs via the REST API or the interactive playground
  • Explore extraction strategies (CSS, XPath, LLM-based)
  • Review Kubernetes events and logs

CLI and REST API Overview

ToolPurpose
gcloudAccess GKE cluster credentials, view logs
kubectlInspect Kubernetes resources
curlSubmit crawl jobs to the Crawl4AI API

Install: Google Cloud SDK | kubectl


Prerequisites

  1. A GCP project with billing enabled.
  2. The Services GCP module deployed in the same project (provides VPC and GKE Autopilot cluster).
  3. The following APIs enabled (Services GCP handles this):
    • container.googleapis.com
    • artifactregistry.googleapis.com
    • secretmanager.googleapis.com
  4. gcloud authenticated: gcloud auth application-default login
  5. Access to the RAD UI with permission to deploy modules in the target GCP project.

Phase 1 — Deploy Infrastructure [AUTOMATED]

Step 1.1 — Configure Variables

VariableRequiredDefaultDescription
project_idYesGCP project ID
tenant_deployment_idNo"demo"Short identifier for this deployment
regionNo"us-central1"GCP region for GKE
application_nameNo"crawl4ai"Base name for Kubernetes resources
application_versionNo"latest"Docker image tag
container_resourcesNo{ cpu_limit = "4", memory_limit = "8Gi", cpu_request = "2", mem_request = "4Gi" }CPU and memory limits per pod
min_instance_countNo1Minimum pod replicas
max_instance_countNo5Maximum pod replicas
redis_task_ttl_secondsNo3600TTL for task results in embedded Redis
service_typeNo"LoadBalancer"Kubernetes Service type
secret_environment_variablesNo{}Inject LLM API keys (e.g., { OPENAI_API_KEY = "my-key" })
support_usersNo[]Email addresses for monitoring alerts

Step 1.2 — Initiate Deployment

Approximate deployment durations:

PhaseDuration
GKE namespace and workload provisioning5–8 min
Artifact Registry image mirror3–5 min
Kubernetes pod rollout3–5 min
Total11–18 min

Step 1.3 — Record Outputs and Configure kubectl

export PROJECT="your-gcp-project-id"
export REGION="us-central1"
export TOKEN=$(gcloud auth print-access-token)

# Discover the GKE cluster
export CLUSTER=$(gcloud container clusters list \
--project=${PROJECT} \
--region=${REGION} \
--format="value(name)" \
--limit=1)

# Configure kubectl
gcloud container clusters get-credentials ${CLUSTER} \
--region=${REGION} \
--project=${PROJECT}

# Discover the Crawl4AI namespace
export NAMESPACE=$(kubectl get namespaces \
-o jsonpath='{.items[*].metadata.name}' | tr ' ' '\n' | grep crawl4ai | head -1)
echo "Namespace: ${NAMESPACE}"

# Get the external IP of the Crawl4AI service
export SERVICE_IP=$(kubectl get service -n ${NAMESPACE} \
-o jsonpath='{.items[?(@.spec.type=="LoadBalancer")].status.loadBalancer.ingress[0].ip}')
export SERVICE_URL="http://${SERVICE_IP}"
echo "Crawl4AI URL: ${SERVICE_URL}"

REST API equivalent (list GKE clusters):

curl -H "Authorization: Bearer ${TOKEN}" \
"https://container.googleapis.com/v1/projects/${PROJECT}/locations/${REGION}/clusters" \
| jq '.clusters[] | {name: .name, status: .status}'

Phase 2 — Access the Application [MANUAL]

Step 2.1 — Verify Pod is Running

kubectl get pods -n ${NAMESPACE}

Expected result: At least one pod shows Running status with all containers ready (1/1 or 2/2).

Step 2.2 — Confirm Crawl4AI is Reachable

curl -s "${SERVICE_URL}/health" | jq .

Expected result:

{"status": "healthy"}

If you see a connection refused error, supervisord may still be starting embedded Redis and Gunicorn (allow 40–60 seconds).

Step 2.3 — Inspect Pod Details

kubectl describe pod -n ${NAMESPACE} \
$(kubectl get pods -n ${NAMESPACE} -o name | head -1) | grep -A10 "Containers:"

REST API equivalent:

curl -H "Authorization: Bearer ${TOKEN}" \
"https://container.googleapis.com/v1/projects/${PROJECT}/locations/${REGION}/clusters/${CLUSTER}/nodePools" \
| jq '.nodePools[] | {name: .name, status: .status}'

Expected result: Pod details show the Crawl4AI container with emptyDir volume for /dev/shm, resource limits (cpu: "4", memory: "8Gi"), and Workload Identity service account annotation.

Step 2.4 — Open the Interactive Playground

Navigate to ${SERVICE_URL}/playground in a browser.

Expected result: The Crawl4AI playground UI loads with fields for URL input, crawl options, and a result viewer.


Phase 3 — Submit Crawl Jobs [MANUAL]

Step 3.1 — Synchronous Crawl (Simple)

The /crawl/sync endpoint blocks until the crawl completes and returns the result directly.

curl -X POST "${SERVICE_URL}/crawl/sync" \
-H "Content-Type: application/json" \
-d '{
"urls": ["https://example.com"],
"crawler_params": {
"headless": true
}
}' | jq '.result.markdown | length'

Expected result: A positive integer (number of Markdown characters extracted). Typically 2000–5000 characters for a simple HTML page.

Step 3.2 — Asynchronous Crawl (Batch)

For longer crawls, submit asynchronously and poll for completion.

# Submit async crawl job
TASK_ID=$(curl -s -X POST "${SERVICE_URL}/crawl" \
-H "Content-Type: application/json" \
-d '{
"urls": ["https://en.wikipedia.org/wiki/Machine_learning"],
"crawler_params": {
"headless": true
}
}' | jq -r '.task_id')
echo "Task ID: ${TASK_ID}"

# Poll for completion
sleep 10
curl -s "${SERVICE_URL}/task/${TASK_ID}" \
| jq '{status: .status, markdown_length: (.result.markdown | length)}'

Expected result: Task status transitions from processing to completed with Markdown content extracted from the Wikipedia article. The task ID is stored in the embedded Redis instance for up to redis_task_ttl_seconds seconds.

Step 3.3 — CSS Selector Extraction

Extract structured content using CSS selectors.

curl -X POST "${SERVICE_URL}/crawl/sync" \
-H "Content-Type: application/json" \
-d '{
"urls": ["https://news.ycombinator.com"],
"extraction_strategy": {
"type": "css",
"instruction": ".titleline > a"
}
}' | jq '.result.extracted_content[:500]'

Expected result: A list of Hacker News post titles extracted via CSS selector.

Step 3.4 — Multi-URL Batch Crawl

TASK_ID=$(curl -s -X POST "${SERVICE_URL}/crawl" \
-H "Content-Type: application/json" \
-d '{
"urls": [
"https://example.com",
"https://example.org"
],
"crawler_params": {
"headless": true,
"wait_for": "load"
}
}' | jq -r '.task_id')

echo "Batch Task ID: ${TASK_ID}"
sleep 15
curl -s "${SERVICE_URL}/task/${TASK_ID}" \
| jq '{status: .status, results_count: (.results | length)}'

Expected result: Status transitions to completed with a results array containing one entry per URL.


Phase 4 — Explore Kubernetes Features [MANUAL]

Step 4.1 — View HPA Status

kubectl get hpa -n ${NAMESPACE}

Expected result: HPA shows current replica count and scaling thresholds based on CPU utilisation.

Step 4.2 — View All Deployments

kubectl get deployments -n ${NAMESPACE}

Expected result: One Deployment listed — the Crawl4AI application.

Step 4.3 — View Application Logs

kubectl logs -n ${NAMESPACE} \
$(kubectl get pods -n ${NAMESPACE} -o name | head -1) \
--tail=50

gcloud equivalent:

gcloud logging read \
'resource.type="k8s_container" AND resource.labels.namespace_name="'${NAMESPACE}'"' \
--project=${PROJECT} \
--limit=30 \
--format="table(timestamp, textPayload)"

REST API equivalent:

curl -X POST \
-H "Authorization: Bearer ${TOKEN}" \
-H "Content-Type: application/json" \
"https://logging.googleapis.com/v2/entries:list" \
-d '{
"projectIds": ["'"${PROJECT}"'"],
"filter": "resource.type=\"k8s_container\" AND resource.labels.namespace_name=\"'"${NAMESPACE}"'\"",
"pageSize": 20
}'

Expected result: Supervisord startup logs appear (Redis start, Gunicorn start), followed by Crawl4AI request logs showing crawled URLs, browser session durations, and extraction results.

Step 4.4 — View Kubernetes Events

kubectl get events -n ${NAMESPACE} --sort-by='.lastTimestamp' | tail -20

Expected result: Events show successful pod scheduling and container starts. No OOMKilled or CrashLoopBackOff events under normal operation.

Step 4.5 — Check /dev/shm Mount

kubectl exec -n ${NAMESPACE} \
$(kubectl get pods -n ${NAMESPACE} -o name | head -1) \
-- df -h /dev/shm

Expected result: /dev/shm is mounted as a tmpfs volume (the emptyDir providing shared memory for Chromium).

Step 4.6 — Scale the Deployment

kubectl scale deployment -n ${NAMESPACE} \
$(kubectl get deployment -n ${NAMESPACE} -o name | head -1) \
--replicas=2
kubectl get pods -n ${NAMESPACE} -w

Expected result: A second pod starts. Press Ctrl+C to stop watching. GKE Autopilot provisions additional node capacity automatically. Each additional pod runs its own embedded Redis and Gunicorn stack independently.


Phase 5 — Undeploy [AUTOMATED]

When you are finished, return to the RAD UI, navigate to your deployment, and click Undeploy (or Delete) to remove all resources provisioned by this module.

Approximate undeploy duration: 5–10 minutes.

Note: Crawl4AI is stateless — no database or persistent storage is provisioned by default. Task results in the embedded Redis instance are lost on pod restart. This is expected behaviour.


Summary

ActionPhaseAutomated
GKE namespace and Kubernetes resources1Yes
Artifact Registry image mirror1Yes
Workload Identity binding1Yes
HPA provisioning1Yes
Configure kubectl and get service IP1No
Confirm Crawl4AI is reachable2No
Inspect pod details2No
Open playground UI2No
Submit synchronous crawl jobs3No
Submit asynchronous crawl jobs3No
CSS selector extraction3No
Multi-URL batch crawl3No
Inspect HPA and scaling4No
View deployments4No
Review logs and events4No
Verify /dev/shm mount4No
Manual scaling test4No
Undeploy infrastructure5Yes