Operations

Documentation Map

Overview
Quick Start
Installation
Configuration
Operations
Troubleshooting
Release Notes
Compatibility
Channel: latest
Source repo: JaddaHelpifyr/jhf-pattern

Operations

Runtime Model

Primary production-like runtime today:

Docker Compose via docker-compose.v2.yml
backend service
frontend service
PostgreSQL

Primary known host:

<internal-runtime-redacted>

Compose Services

openclaw-pm-v2-postgres
openclaw-pm-v2-backend
openclaw-pm-v2-frontend

Active live container names on the primary host:

jhf-pattern-openclaw-pm-v2-postgres-1
jhf-pattern-openclaw-pm-v2-backend-1
jhf-pattern-openclaw-pm-v2-frontend-1

Start / Run / Deploy

Local

backend:
- uvicorn backend.app.main:app --reload
frontend:
- npm run dev

Host-style deploy

use the repo checkout plus docker-compose.v2.yml
rebuild the changed service or full stack as needed

Common commands:

docker compose -p jhf-pattern -f docker-compose.v2.yml up -d --build
docker compose -p jhf-pattern -f docker-compose.v2.yml up -d --build openclaw-pm-v2-backend
docker compose -p jhf-pattern -f docker-compose.v2.yml -f docker-compose.backend-healthcheck.yml up -d --build openclaw-pm-v2-backend
docker compose -p jhf-pattern -f docker-compose.v2.yml ps

Low-CPU host mode

Use the override below on weaker external hosts where reducing healthcheck churn matters more than fast local feedback:

docker compose -p jhf-pattern -f docker-compose.v2.yml -f docker-compose.low-cpu.yml up -d --build

Optional backend Docker healthcheck

Keep the backend healthcheck disabled by default when external /readyz verification is enough. If operators explicitly need Docker-internal backend health state, enable the lightweight probe with:

docker compose -p jhf-pattern -f docker-compose.v2.yml -f docker-compose.backend-healthcheck.yml up -d --build openclaw-pm-v2-backend

Do not keep separate verify/test containers running permanently beside the runtime stack. Run verification commands explicitly, record the result, and stop temporary test processes afterwards.

Health and Readiness

Available runtime surfaces:

/healthz
/readyz
/versionz
/contractz
/fabric-presencez
/fabric-compatibilityz
/fabric-manifest.json

These are the current minimum contract surfaces for health, readiness, version, and fabric discovery.

Verify contract classes:

liveness:
- /healthz
- use for cheap HTTP reachability only
readiness:
- /readyz
- use as the default rollout, restart, and recovery gate
contract verification:
- /versionz
- /contractz
- /fabric-presencez
- /fabric-compatibilityz
- /fabric-manifest.json
local proof:
- bash scripts/fabric-selfcheck.sh
- bash scripts/runtime-port-discovery.sh

Default verify path:

curl or browser-check /healthz
curl or browser-check /readyz
cd frontend && npm run test:smoke:live
when the home shell changed, also run cd frontend && npm run test:smoke:new-ui:live
inspect /versionz and /fabric-manifest.json when contract drift is suspected
run bash scripts/fabric-selfcheck.sh
run bash scripts/runtime-port-discovery.sh and capture the JSON evidence
run bash scripts/verify_runtime_guardrails.sh when host-side CPU-safety, bounded diagnostics, restart policy, and cleanup discipline are part of the current rollout scope
run bash scripts/verify_runtime_materialization_drift.sh to verify repo-truth vs compose/override vs container-env vs app-readback consistency
use deeper UI or project-specific flows only after these cheap checks are green

Live verification split:

bounded default live smoke:
- cd frontend && npm run test:smoke:live
- covers the stable fast path across auth, preview/home, work, settings, repo settings, and readiness evidence
- canonical target host is https://<internal-runtime-redacted>
full live host sweep:
- cd frontend && npm run test:smoke:live:full
- broader operator and mutation coverage
- not the default fast verify command because it is intentionally larger

Temporary fallback when the canonical ingress edge is degraded:

cd frontend && PLAYWRIGHT_LIVE_BASE_URL=http://<internal-runtime-redacted>:15173 PLAYWRIGHT_LIVE_API_BASE=api/v1 npm run test:smoke:live
use this only to separate app/runtime health from ingress cutover defects

Current Docker healthcheck policy:

Service	Docker healthcheck	Interval	Start period	Why
`openclaw-pm-v2-postgres`	`pg_isready -U openclaw -d openclaw_pm_v2 -h <internal-runtime-redacted> -p 5432`	`90s` default, `120s` with low-CPU override	`90s` default, `120s` low-CPU	critical dependency with a native lightweight probe and reduced steady-state exec churn
`openclaw-pm-v2-backend`	`/usr/local/bin/backend-ready-probe.sh` via `docker-compose.backend-healthcheck.yml`	`90s` opt-in	`90s` opt-in	lightweight native `curl` probe against `/readyz` when Docker-internal health state is operationally useful
`openclaw-pm-v2-frontend`	none	n/a	n/a	static nginx surface; external HTTP verification is cheaper and clearer

Steady-state CPU guidance:

on shared or weak hosts, prefer 90s+ healthcheck intervals for repo-owned services
use start_period to absorb startup and recovery time instead of keeping 30s steady-state probes
keep backend healthchecks opt-in when external /readyz is enough

Host-side validation after rollout should include:

docker compose ... ps
docker inspect <container> --format '{{json .State.Health}}' for healthchecked services
external curl against /readyz and the frontend root
short docker events --filter event=exec_create sampling if healthcheck churn is under review
bash scripts/runtime-load-snapshot.sh before and after runtime-affecting rollouts when weak-host load is under review

Repeatable host snapshot:

bash scripts/runtime-load-snapshot.sh
bash scripts/runtime-load-snapshot.sh 120

The helper records:

host CPU busy percentage, load average, and memory summary
per-container docker stats --no-stream
current Docker health states for the compose project
sampled exec_create event churn during the requested window

Logs

Operational diagnosis should include:

backend container logs
frontend container logs
deploy-time build output
API route errors surfaced in backend logs

Concrete commands:

bash scripts/safe_docker_logs.sh jhf-pattern-openclaw-pm-v2-backend-1
bash scripts/safe_docker_logs.sh jhf-pattern-openclaw-pm-v2-frontend-1
bash scripts/safe_docker_logs.sh jhf-pattern-openclaw-pm-v2-postgres-1

Rules:

do not use unbounded docker logs or docker compose logs in the standard shared-host verify flow
use bounded log snapshots only when /readyz, docker compose ps, and the runtime guardrail verify are not enough
after host verification, no lingering docker logs or docker events process should remain

Monitoring

Useful operator monitoring today:

readiness checks
provider connection status
Plane sync status
audit/recovery views in the product
/metrics for sync drift and contract counters

Current gap:

no repo-local Grafana integration

Common Failure Modes

provider connected but no data visible
- usually configuration, auth, or upstream-provider drift
Plane state differs from Mission Control
- inspect sync mode, webhook status, and mapping state
repo-backed knowledge pages missing
- inspect repo connection and planning/pages/sync-from-repo
Fabric discovery endpoint mismatch
- inspect fabric-manifest.json, /contractz, and scripts/fabric-selfcheck.sh

Restart / Recovery Hints

prefer service-specific rebuilds rather than ad-hoc host edits
verify runtime state after redeploy with /readyz
for sync-related regressions, verify both endpoint behavior and operator UI state

Concrete restart commands:

docker compose -p jhf-pattern -f docker-compose.v2.yml restart openclaw-pm-v2-backend
docker compose -p jhf-pattern -f docker-compose.v2.yml restart openclaw-pm-v2-frontend

Runtime Dependencies

PostgreSQL
provider APIs where configured
Plane API where configured
n8n and runtime webhooks where configured

Known Operational Gaps

no standardized event bus contract
no repo-local dashboard export to Grafana or Gitea dashboard surfaces

Minimal Alert Conditions

high
- /readyz not returning success
- backend container restarting repeatedly
- PostgreSQL healthcheck failing
- sync paths returning repeated 5xx errors
- /metrics shows drift counters growing for Plane guarded sync or repo-page reconcile failures
medium
- Plane sync drift persists after a reconcile attempt
- provider connection tests fail
- repo-backed page sync fails
low
- documentation/manifests drift while runtime stays healthy

Rollout And Rollback Baseline

Before a runtime-config rollout:

capture the current deployed commit or image identity
record a short host snapshot with bash scripts/runtime-load-snapshot.sh
validate the compose file with docker compose ... config

For this repo, the normal rollback path is:

redeploy the last known-good repo commit into the stack checkout
run docker compose ... up -d --build
verify /readyz, frontend HTTP, health state, and bash scripts/verify_runtime_guardrails.sh again

Canonical operator runbooks:

backup and restore:
- docs/RECOVERY_RUNBOOK.md (docs/RECOVERY_RUNBOOK.md)
deployment promotion:
- docs/DEPLOYMENT_PROMOTION_RUNBOOK.md (docs/DEPLOYMENT_PROMOTION_RUNBOOK.md)
routine upgrades and rollback boundary:
- docs/UPGRADE_RUNBOOK.md (docs/UPGRADE_RUNBOOK.md)

Prometheus-Ready Metrics Surface

Mission Control now exposes a read-only metrics surface at:

curl -sf http://<internal-runtime-redacted>:18081/metrics

Current operator-critical metric groups:

openclaw_mc_projects_total
openclaw_mc_integration_connections_total
openclaw_mc_integration_connection_last_test_total
openclaw_mc_sync_surface_total
openclaw_mc_external_mappings_total
openclaw_mc_guarded_sync_state_total
openclaw_mc_plane_read_reconcile_total
openclaw_mc_repo_page_reconcile_total

Interpretation baseline:

openclaw_mc_sync_surface_total{sync_state="drift"} above 0
- planning or page state needs operator review
openclaw_mc_guarded_sync_state_total{status="conflict"} above 0
- Plane guarded sync has unresolved conflicts
openclaw_mc_plane_read_reconcile_total{status="error"} above 0
- Plane auto-read reconcile failed for at least one configured bridge
openclaw_mc_repo_page_reconcile_total{status="error"} above 0
- repo-backed page reconcile failed for at least one configured bridge
openclaw_mc_integration_connection_last_test_total{status="reachable"} missing for a configured provider
- the connection has not been re-verified successfully yet

Adopt-First Rollout Boundary

For a future standalone -> integrated adoption, treat resources as follows:

adopt:
- repo checkout
- product docs and manifest contracts
reconfigure:
- compose/runtime env wiring
- OAuth and workspace/provider credentials
migrate:
- database-backed Mission Control state
rebuild:
- host-local webhook targets and environment-specific runtime endpoints

Do not assume that a healthy standalone host can be attached to a shared-service environment without explicit operator review of credentials, webhook targets, and database migration scope.

License: AGPLv3

Helpifyr: https://helpifyr.com

Documentation Map​

Operations

Runtime Model​

Compose Services​

Start / Run / Deploy​

Local​

Host-style deploy​

Low-CPU host mode​

Optional backend Docker healthcheck​

Health and Readiness​

Logs​

Monitoring​

Common Failure Modes​

Restart / Recovery Hints​

Runtime Dependencies​

Known Operational Gaps​

Minimal Alert Conditions​

Rollout And Rollback Baseline​

Prometheus-Ready Metrics Surface​

Adopt-First Rollout Boundary​

Documentation Map

Runtime Model

Compose Services

Start / Run / Deploy

Local

Host-style deploy

Low-CPU host mode

Optional backend Docker healthcheck

Health and Readiness

Logs

Monitoring

Common Failure Modes

Restart / Recovery Hints

Runtime Dependencies

Known Operational Gaps

Minimal Alert Conditions

Rollout And Rollback Baseline

Prometheus-Ready Metrics Surface

Adopt-First Rollout Boundary