Skip to main content

Operations

Documentation Map

Operations

Runtime Model

Primary production-like runtime today:

  • Docker Compose via docker-compose.v2.yml
  • backend service
  • frontend service
  • PostgreSQL

Primary known host:

  • <internal-runtime-redacted>

Compose Services

  • openclaw-pm-v2-postgres
  • openclaw-pm-v2-backend
  • openclaw-pm-v2-frontend

Active live container names on the primary host:

  • jhf-pattern-openclaw-pm-v2-postgres-1
  • jhf-pattern-openclaw-pm-v2-backend-1
  • jhf-pattern-openclaw-pm-v2-frontend-1

Start / Run / Deploy

Local

  • backend:
    • uvicorn backend.app.main:app --reload
  • frontend:
    • npm run dev

Host-style deploy

  • use the repo checkout plus docker-compose.v2.yml
  • rebuild the changed service or full stack as needed

Common commands:

docker compose -p jhf-pattern -f docker-compose.v2.yml up -d --build
docker compose -p jhf-pattern -f docker-compose.v2.yml up -d --build openclaw-pm-v2-backend
docker compose -p jhf-pattern -f docker-compose.v2.yml -f docker-compose.backend-healthcheck.yml up -d --build openclaw-pm-v2-backend
docker compose -p jhf-pattern -f docker-compose.v2.yml ps

Low-CPU host mode

Use the override below on weaker external hosts where reducing healthcheck churn matters more than fast local feedback:

docker compose -p jhf-pattern -f docker-compose.v2.yml -f docker-compose.low-cpu.yml up -d --build

Optional backend Docker healthcheck

Keep the backend healthcheck disabled by default when external /readyz verification is enough. If operators explicitly need Docker-internal backend health state, enable the lightweight probe with:

docker compose -p jhf-pattern -f docker-compose.v2.yml -f docker-compose.backend-healthcheck.yml up -d --build openclaw-pm-v2-backend

Do not keep separate verify/test containers running permanently beside the runtime stack. Run verification commands explicitly, record the result, and stop temporary test processes afterwards.

Health and Readiness

Available runtime surfaces:

  • /healthz
  • /readyz
  • /versionz
  • /contractz
  • /fabric-presencez
  • /fabric-compatibilityz
  • /fabric-manifest.json

These are the current minimum contract surfaces for health, readiness, version, and fabric discovery.

Verify contract classes:

  • liveness:
    • /healthz
    • use for cheap HTTP reachability only
  • readiness:
    • /readyz
    • use as the default rollout, restart, and recovery gate
  • contract verification:
    • /versionz
    • /contractz
    • /fabric-presencez
    • /fabric-compatibilityz
    • /fabric-manifest.json
  • local proof:
    • bash scripts/fabric-selfcheck.sh
    • bash scripts/runtime-port-discovery.sh

Default verify path:

  1. curl or browser-check /healthz
  2. curl or browser-check /readyz
  3. cd frontend && npm run test:smoke:live
  4. when the home shell changed, also run cd frontend && npm run test:smoke:new-ui:live
  5. inspect /versionz and /fabric-manifest.json when contract drift is suspected
  6. run bash scripts/fabric-selfcheck.sh
  7. run bash scripts/runtime-port-discovery.sh and capture the JSON evidence
  8. run bash scripts/verify_runtime_guardrails.sh when host-side CPU-safety, bounded diagnostics, restart policy, and cleanup discipline are part of the current rollout scope
  9. run bash scripts/verify_runtime_materialization_drift.sh to verify repo-truth vs compose/override vs container-env vs app-readback consistency
  10. use deeper UI or project-specific flows only after these cheap checks are green

Live verification split:

  • bounded default live smoke:
    • cd frontend && npm run test:smoke:live
    • covers the stable fast path across auth, preview/home, work, settings, repo settings, and readiness evidence
    • canonical target host is https://<internal-runtime-redacted>
  • full live host sweep:
    • cd frontend && npm run test:smoke:live:full
    • broader operator and mutation coverage
    • not the default fast verify command because it is intentionally larger

Temporary fallback when the canonical ingress edge is degraded:

  • cd frontend && PLAYWRIGHT_LIVE_BASE_URL=http://<internal-runtime-redacted>:15173 PLAYWRIGHT_LIVE_API_BASE=api/v1 npm run test:smoke:live
  • use this only to separate app/runtime health from ingress cutover defects

Current Docker healthcheck policy:

ServiceDocker healthcheckIntervalStart periodWhy
openclaw-pm-v2-postgrespg_isready -U openclaw -d openclaw_pm_v2 -h <internal-runtime-redacted> -p 543290s default, 120s with low-CPU override90s default, 120s low-CPUcritical dependency with a native lightweight probe and reduced steady-state exec churn
openclaw-pm-v2-backend/usr/local/bin/backend-ready-probe.sh via docker-compose.backend-healthcheck.yml90s opt-in90s opt-inlightweight native curl probe against /readyz when Docker-internal health state is operationally useful
openclaw-pm-v2-frontendnonen/an/astatic nginx surface; external HTTP verification is cheaper and clearer

Steady-state CPU guidance:

  • on shared or weak hosts, prefer 90s+ healthcheck intervals for repo-owned services
  • use start_period to absorb startup and recovery time instead of keeping 30s steady-state probes
  • keep backend healthchecks opt-in when external /readyz is enough

Host-side validation after rollout should include:

  • docker compose ... ps
  • docker inspect <container> --format '{{json .State.Health}}' for healthchecked services
  • external curl against /readyz and the frontend root
  • short docker events --filter event=exec_create sampling if healthcheck churn is under review
  • bash scripts/runtime-load-snapshot.sh before and after runtime-affecting rollouts when weak-host load is under review

Repeatable host snapshot:

bash scripts/runtime-load-snapshot.sh
bash scripts/runtime-load-snapshot.sh 120

The helper records:

  • host CPU busy percentage, load average, and memory summary
  • per-container docker stats --no-stream
  • current Docker health states for the compose project
  • sampled exec_create event churn during the requested window

Logs

Operational diagnosis should include:

  • backend container logs
  • frontend container logs
  • deploy-time build output
  • API route errors surfaced in backend logs

Concrete commands:

bash scripts/safe_docker_logs.sh jhf-pattern-openclaw-pm-v2-backend-1
bash scripts/safe_docker_logs.sh jhf-pattern-openclaw-pm-v2-frontend-1
bash scripts/safe_docker_logs.sh jhf-pattern-openclaw-pm-v2-postgres-1

Rules:

  • do not use unbounded docker logs or docker compose logs in the standard shared-host verify flow
  • use bounded log snapshots only when /readyz, docker compose ps, and the runtime guardrail verify are not enough
  • after host verification, no lingering docker logs or docker events process should remain

Monitoring

Useful operator monitoring today:

  • readiness checks
  • provider connection status
  • Plane sync status
  • audit/recovery views in the product
  • /metrics for sync drift and contract counters

Current gap:

  • no repo-local Grafana integration

Common Failure Modes

  • provider connected but no data visible
    • usually configuration, auth, or upstream-provider drift
  • Plane state differs from Mission Control
    • inspect sync mode, webhook status, and mapping state
  • repo-backed knowledge pages missing
    • inspect repo connection and planning/pages/sync-from-repo
  • Fabric discovery endpoint mismatch
    • inspect fabric-manifest.json, /contractz, and scripts/fabric-selfcheck.sh

Restart / Recovery Hints

  • prefer service-specific rebuilds rather than ad-hoc host edits
  • verify runtime state after redeploy with /readyz
  • for sync-related regressions, verify both endpoint behavior and operator UI state

Concrete restart commands:

docker compose -p jhf-pattern -f docker-compose.v2.yml restart openclaw-pm-v2-backend
docker compose -p jhf-pattern -f docker-compose.v2.yml restart openclaw-pm-v2-frontend

Runtime Dependencies

  • PostgreSQL
  • provider APIs where configured
  • Plane API where configured
  • n8n and runtime webhooks where configured

Known Operational Gaps

  • no standardized event bus contract
  • no repo-local dashboard export to Grafana or Gitea dashboard surfaces

Minimal Alert Conditions

  • high
    • /readyz not returning success
    • backend container restarting repeatedly
    • PostgreSQL healthcheck failing
    • sync paths returning repeated 5xx errors
    • /metrics shows drift counters growing for Plane guarded sync or repo-page reconcile failures
  • medium
    • Plane sync drift persists after a reconcile attempt
    • provider connection tests fail
    • repo-backed page sync fails
  • low
    • documentation/manifests drift while runtime stays healthy

Rollout And Rollback Baseline

Before a runtime-config rollout:

  • capture the current deployed commit or image identity
  • record a short host snapshot with bash scripts/runtime-load-snapshot.sh
  • validate the compose file with docker compose ... config

For this repo, the normal rollback path is:

  1. redeploy the last known-good repo commit into the stack checkout
  2. run docker compose ... up -d --build
  3. verify /readyz, frontend HTTP, health state, and bash scripts/verify_runtime_guardrails.sh again

Canonical operator runbooks:

  • backup and restore:
    • docs/RECOVERY_RUNBOOK.md (docs/RECOVERY_RUNBOOK.md)
  • deployment promotion:
    • docs/DEPLOYMENT_PROMOTION_RUNBOOK.md (docs/DEPLOYMENT_PROMOTION_RUNBOOK.md)
  • routine upgrades and rollback boundary:
    • docs/UPGRADE_RUNBOOK.md (docs/UPGRADE_RUNBOOK.md)

Prometheus-Ready Metrics Surface

Mission Control now exposes a read-only metrics surface at:

curl -sf http://<internal-runtime-redacted>:18081/metrics

Current operator-critical metric groups:

  • openclaw_mc_projects_total
  • openclaw_mc_integration_connections_total
  • openclaw_mc_integration_connection_last_test_total
  • openclaw_mc_sync_surface_total
  • openclaw_mc_external_mappings_total
  • openclaw_mc_guarded_sync_state_total
  • openclaw_mc_plane_read_reconcile_total
  • openclaw_mc_repo_page_reconcile_total

Interpretation baseline:

  • openclaw_mc_sync_surface_total{sync_state="drift"} above 0
    • planning or page state needs operator review
  • openclaw_mc_guarded_sync_state_total{status="conflict"} above 0
    • Plane guarded sync has unresolved conflicts
  • openclaw_mc_plane_read_reconcile_total{status="error"} above 0
    • Plane auto-read reconcile failed for at least one configured bridge
  • openclaw_mc_repo_page_reconcile_total{status="error"} above 0
    • repo-backed page reconcile failed for at least one configured bridge
  • openclaw_mc_integration_connection_last_test_total{status="reachable"} missing for a configured provider
    • the connection has not been re-verified successfully yet

Adopt-First Rollout Boundary

For a future standalone -> integrated adoption, treat resources as follows:

  • adopt:
    • repo checkout
    • product docs and manifest contracts
  • reconfigure:
    • compose/runtime env wiring
    • OAuth and workspace/provider credentials
  • migrate:
    • database-backed Mission Control state
  • rebuild:
    • host-local webhook targets and environment-specific runtime endpoints

Do not assume that a healthy standalone host can be attached to a shared-service environment without explicit operator review of credentials, webhook targets, and database migration scope.


License: AGPLv3

Helpifyr: https://helpifyr.com