Operations
Documentation Map
-
Operations
-
Channel:
latest -
Source repo:
JaddaHelpifyr/jhf-pattern
Operations
Runtime Model
Primary production-like runtime today:
- Docker Compose via
docker-compose.v2.yml - backend service
- frontend service
- PostgreSQL
Primary known host:
<internal-runtime-redacted>
Compose Services
openclaw-pm-v2-postgresopenclaw-pm-v2-backendopenclaw-pm-v2-frontend
Active live container names on the primary host:
jhf-pattern-openclaw-pm-v2-postgres-1jhf-pattern-openclaw-pm-v2-backend-1jhf-pattern-openclaw-pm-v2-frontend-1
Start / Run / Deploy
Local
- backend:
uvicorn backend.app.main:app --reload
- frontend:
npm run dev
Host-style deploy
- use the repo checkout plus
docker-compose.v2.yml - rebuild the changed service or full stack as needed
Common commands:
docker compose -p jhf-pattern -f docker-compose.v2.yml up -d --build
docker compose -p jhf-pattern -f docker-compose.v2.yml up -d --build openclaw-pm-v2-backend
docker compose -p jhf-pattern -f docker-compose.v2.yml -f docker-compose.backend-healthcheck.yml up -d --build openclaw-pm-v2-backend
docker compose -p jhf-pattern -f docker-compose.v2.yml ps
Low-CPU host mode
Use the override below on weaker external hosts where reducing healthcheck churn matters more than fast local feedback:
docker compose -p jhf-pattern -f docker-compose.v2.yml -f docker-compose.low-cpu.yml up -d --build
Optional backend Docker healthcheck
Keep the backend healthcheck disabled by default when external /readyz verification is enough. If operators explicitly need Docker-internal backend health state, enable the lightweight probe with:
docker compose -p jhf-pattern -f docker-compose.v2.yml -f docker-compose.backend-healthcheck.yml up -d --build openclaw-pm-v2-backend
Do not keep separate verify/test containers running permanently beside the runtime stack. Run verification commands explicitly, record the result, and stop temporary test processes afterwards.
Health and Readiness
Available runtime surfaces:
/healthz/readyz/versionz/contractz/fabric-presencez/fabric-compatibilityz/fabric-manifest.json
These are the current minimum contract surfaces for health, readiness, version, and fabric discovery.
Verify contract classes:
- liveness:
/healthz- use for cheap HTTP reachability only
- readiness:
/readyz- use as the default rollout, restart, and recovery gate
- contract verification:
/versionz/contractz/fabric-presencez/fabric-compatibilityz/fabric-manifest.json
- local proof:
bash scripts/fabric-selfcheck.shbash scripts/runtime-port-discovery.sh
Default verify path:
curlor browser-check/healthzcurlor browser-check/readyzcd frontend && npm run test:smoke:live- when the home shell changed, also run
cd frontend && npm run test:smoke:new-ui:live - inspect
/versionzand/fabric-manifest.jsonwhen contract drift is suspected - run
bash scripts/fabric-selfcheck.sh - run
bash scripts/runtime-port-discovery.shand capture the JSON evidence - run
bash scripts/verify_runtime_guardrails.shwhen host-side CPU-safety, bounded diagnostics, restart policy, and cleanup discipline are part of the current rollout scope - run
bash scripts/verify_runtime_materialization_drift.shto verify repo-truth vs compose/override vs container-env vs app-readback consistency - use deeper UI or project-specific flows only after these cheap checks are green
Live verification split:
- bounded default live smoke:
cd frontend && npm run test:smoke:live- covers the stable fast path across auth, preview/home, work, settings, repo settings, and readiness evidence
- canonical target host is
https://<internal-runtime-redacted>
- full live host sweep:
cd frontend && npm run test:smoke:live:full- broader operator and mutation coverage
- not the default fast verify command because it is intentionally larger
Temporary fallback when the canonical ingress edge is degraded:
cd frontend && PLAYWRIGHT_LIVE_BASE_URL=http://<internal-runtime-redacted>:15173 PLAYWRIGHT_LIVE_API_BASE=api/v1 npm run test:smoke:live- use this only to separate app/runtime health from ingress cutover defects
Current Docker healthcheck policy:
| Service | Docker healthcheck | Interval | Start period | Why |
|---|---|---|---|---|
openclaw-pm-v2-postgres | pg_isready -U openclaw -d openclaw_pm_v2 -h <internal-runtime-redacted> -p 5432 | 90s default, 120s with low-CPU override | 90s default, 120s low-CPU | critical dependency with a native lightweight probe and reduced steady-state exec churn |
openclaw-pm-v2-backend | /usr/local/bin/backend-ready-probe.sh via docker-compose.backend-healthcheck.yml | 90s opt-in | 90s opt-in | lightweight native curl probe against /readyz when Docker-internal health state is operationally useful |
openclaw-pm-v2-frontend | none | n/a | n/a | static nginx surface; external HTTP verification is cheaper and clearer |
Steady-state CPU guidance:
- on shared or weak hosts, prefer
90s+healthcheck intervals for repo-owned services - use
start_periodto absorb startup and recovery time instead of keeping30ssteady-state probes - keep backend healthchecks opt-in when external
/readyzis enough
Host-side validation after rollout should include:
docker compose ... psdocker inspect <container> --format '{{json .State.Health}}'for healthchecked services- external
curlagainst/readyzand the frontend root - short
docker events --filter event=exec_createsampling if healthcheck churn is under review bash scripts/runtime-load-snapshot.shbefore and after runtime-affecting rollouts when weak-host load is under review
Repeatable host snapshot:
bash scripts/runtime-load-snapshot.sh
bash scripts/runtime-load-snapshot.sh 120
The helper records:
- host CPU busy percentage, load average, and memory summary
- per-container
docker stats --no-stream - current Docker health states for the compose project
- sampled
exec_createevent churn during the requested window
Logs
Operational diagnosis should include:
- backend container logs
- frontend container logs
- deploy-time build output
- API route errors surfaced in backend logs
Concrete commands:
bash scripts/safe_docker_logs.sh jhf-pattern-openclaw-pm-v2-backend-1
bash scripts/safe_docker_logs.sh jhf-pattern-openclaw-pm-v2-frontend-1
bash scripts/safe_docker_logs.sh jhf-pattern-openclaw-pm-v2-postgres-1
Rules:
- do not use unbounded
docker logsordocker compose logsin the standard shared-host verify flow - use bounded log snapshots only when
/readyz,docker compose ps, and the runtime guardrail verify are not enough - after host verification, no lingering
docker logsordocker eventsprocess should remain
Monitoring
Useful operator monitoring today:
- readiness checks
- provider connection status
- Plane sync status
- audit/recovery views in the product
/metricsfor sync drift and contract counters
Current gap:
- no repo-local Grafana integration
Common Failure Modes
- provider connected but no data visible
- usually configuration, auth, or upstream-provider drift
- Plane state differs from Mission Control
- inspect sync mode, webhook status, and mapping state
- repo-backed knowledge pages missing
- inspect repo connection and
planning/pages/sync-from-repo
- inspect repo connection and
- Fabric discovery endpoint mismatch
- inspect
fabric-manifest.json,/contractz, andscripts/fabric-selfcheck.sh
- inspect
Restart / Recovery Hints
- prefer service-specific rebuilds rather than ad-hoc host edits
- verify runtime state after redeploy with
/readyz - for sync-related regressions, verify both endpoint behavior and operator UI state
Concrete restart commands:
docker compose -p jhf-pattern -f docker-compose.v2.yml restart openclaw-pm-v2-backend
docker compose -p jhf-pattern -f docker-compose.v2.yml restart openclaw-pm-v2-frontend
Runtime Dependencies
- PostgreSQL
- provider APIs where configured
- Plane API where configured
- n8n and runtime webhooks where configured
Known Operational Gaps
- no standardized event bus contract
- no repo-local dashboard export to Grafana or Gitea dashboard surfaces
Minimal Alert Conditions
high/readyznot returning success- backend container restarting repeatedly
- PostgreSQL healthcheck failing
- sync paths returning repeated 5xx errors
/metricsshows drift counters growing for Plane guarded sync or repo-page reconcile failures
medium- Plane sync drift persists after a reconcile attempt
- provider connection tests fail
- repo-backed page sync fails
low- documentation/manifests drift while runtime stays healthy
Rollout And Rollback Baseline
Before a runtime-config rollout:
- capture the current deployed commit or image identity
- record a short host snapshot with
bash scripts/runtime-load-snapshot.sh - validate the compose file with
docker compose ... config
For this repo, the normal rollback path is:
- redeploy the last known-good repo commit into the stack checkout
- run
docker compose ... up -d --build - verify
/readyz, frontend HTTP, health state, andbash scripts/verify_runtime_guardrails.shagain
Canonical operator runbooks:
- backup and restore:
- docs/RECOVERY_RUNBOOK.md (
docs/RECOVERY_RUNBOOK.md)
- docs/RECOVERY_RUNBOOK.md (
- deployment promotion:
- docs/DEPLOYMENT_PROMOTION_RUNBOOK.md (
docs/DEPLOYMENT_PROMOTION_RUNBOOK.md)
- docs/DEPLOYMENT_PROMOTION_RUNBOOK.md (
- routine upgrades and rollback boundary:
- docs/UPGRADE_RUNBOOK.md (
docs/UPGRADE_RUNBOOK.md)
- docs/UPGRADE_RUNBOOK.md (
Prometheus-Ready Metrics Surface
Mission Control now exposes a read-only metrics surface at:
curl -sf http://<internal-runtime-redacted>:18081/metrics
Current operator-critical metric groups:
openclaw_mc_projects_totalopenclaw_mc_integration_connections_totalopenclaw_mc_integration_connection_last_test_totalopenclaw_mc_sync_surface_totalopenclaw_mc_external_mappings_totalopenclaw_mc_guarded_sync_state_totalopenclaw_mc_plane_read_reconcile_totalopenclaw_mc_repo_page_reconcile_total
Interpretation baseline:
openclaw_mc_sync_surface_total{sync_state="drift"}above0- planning or page state needs operator review
openclaw_mc_guarded_sync_state_total{status="conflict"}above0- Plane guarded sync has unresolved conflicts
openclaw_mc_plane_read_reconcile_total{status="error"}above0- Plane auto-read reconcile failed for at least one configured bridge
openclaw_mc_repo_page_reconcile_total{status="error"}above0- repo-backed page reconcile failed for at least one configured bridge
openclaw_mc_integration_connection_last_test_total{status="reachable"}missing for a configured provider- the connection has not been re-verified successfully yet
Adopt-First Rollout Boundary
For a future standalone -> integrated adoption, treat resources as follows:
- adopt:
- repo checkout
- product docs and manifest contracts
- reconfigure:
- compose/runtime env wiring
- OAuth and workspace/provider credentials
- migrate:
- database-backed Mission Control state
- rebuild:
- host-local webhook targets and environment-specific runtime endpoints
Do not assume that a healthy standalone host can be attached to a shared-service environment without explicit operator review of credentials, webhook targets, and database migration scope.
License: AGPLv3
Helpifyr: https://helpifyr.com