Operations
Documentation Map
-
Operations
-
Channel:
latest -
Source repo:
JaddaHelpifyr/jhf-warp
Operations
Start / Run / Deploy
Local development:
./scripts/dev.sh bootstrap./scripts/dev.sh run- INSTALL.md
- CONFIGURATION.md
CI verification:
./scripts/ci.sh
OCI image helper:
./scripts/oci_image.sh print-tags./scripts/oci_image.sh build./scripts/oci_image.sh publish
Operator Entry Paths
- Local service path:
./scripts/dev.sh bootstrap./scripts/ci.sh./scripts/dev.sh run
- Standalone host path:
deploy/compose/jhf-warp.stack.yml- PRODUCTION_STACK_DEPLOYMENT.md (
docs/PRODUCTION_STACK_DEPLOYMENT.md) - OPERATOR_RUNBOOK.md (
docs/OPERATOR_RUNBOOK.md)
- Fabric read-first consumer path:
/health/ready/version/fabric-manifest.json/openapi.json
- OCI consumer path:
./scripts/oci_image.sh print-tags- OCI_IMAGE_PATH.md (
docs/OCI_IMAGE_PATH.md)
Healthchecks
GET /health- basic liveness
GET /ready- readiness plus warnings, capability keys, and self-description references
GET /version- canonical version endpoint
- Compose healthcheck policy:
- no interval below
20s - production default:
120s - optional low-CPU production override:
180s - integration stack:
20son Postgres only, then stop the stack after verification
- no interval below
- Production container checks:
api: lightweight TCP socket open on<internal-runtime-redacted>:8080postgres: nativepg_isready
Deployment Boundary
Minimum operator baseline:
- reverse proxy or gateway in front of the service
- internal-network-only exposure wherever possible
- self-description endpoints handled separately from mutation/control surfaces
- mutating routes default-denied unless an authenticated internal caller explicitly needs them
- configure
JHF_WARP_FABRIC_CONTEXT_BASE_URLfor the current projection/composition layer - optionally configure
JHF_WARP_FABRIC_CONTEXT_AUTH_TOKENfor Warp -> Fabric service auth - verify host-local env files with
python scripts/verify_host_env_contract.py <env-file>and keep only canonicalJHF_WARP_*keys in live deployment files - verify runtime materialization drift with
python scripts/verify_runtime_materialization.py --host <internal-runtime-redacted>so repo truth, host env, container env, compose labels, and app readback stay aligned
Read-only self-description surfaces:
/health/ready/version/openapi.json/fabric-manifest.json
Mutating/control surfaces that should stay on authenticated internal-only paths:
/api/v1/openclaw/patch/*/api/v1/execution/*/api/v1/control-agent/*- persistent learning proposal review/write paths
Projected authority contract:
- self-description endpoints remain open
- internal routes require
Authorization: Bearer <token> - Heddle stays upstream auth truth
- Fabric currently normalizes/projects context
- future normative governance docks at Spine
- internal write/control endpoints fail closed when the projected authority context is unavailable or incomplete
Safe Docker Log Diagnostics
Use bounded snapshots only for live-host Docker log inspection.
Rules:
- never run unbounded
docker logson a live host - always include
--since - always include
--tail - always wrap the call in a hard
timeout - prefer one bounded snapshot over long-running follow mode
Repo helper:
./scripts/safe_docker_logs.sh jhf-warp-api 10m 200 20
Equivalent raw pattern:
timeout --foreground 20s docker logs --since 10m --tail 200 jhf-warp-api
Runtime Guardrails
CPU-safe runtime guardrails for the shared host baseline:
- no-repeat, low-pressure diagnostics only
- repo-owned stack truth stays
jhf-warpwith canonicaljhf-warp-*container names - default shared-host health and watchdog cadence must stay non-aggressive (
>= 60s) - restart handling must use bounded backoff instead of tight loops
- every deploy/verify pass must end with a bounded post-deploy cleanup check
- rerunning the same bounded verify flow must stay idempotent and leave no hanging debug helpers
Canonical verifier:
python scripts/verify_runtime_guardrails.py --report artifacts/runtime-guardrails-report.json
python scripts/verify_runtime_guardrails.py --host <internal-runtime-redacted> --report artifacts/runtime-guardrails-live-report.json
python scripts/verify_runtime_materialization.py
python scripts/verify_runtime_materialization.py --host <internal-runtime-redacted>
python scripts/verify_agent_capability_policy_projection.py
Troubleshooting shortcut:
Standard bounded diagnostics evidence:
- repo/CI path writes
artifacts/runtime-guardrails-report.json - live host path writes
artifacts/runtime-guardrails-live-report.json - the smoke workflow uploads the repo report as the canonical bounded diagnostics artifact
Post-deploy cleanup/postcheck expectations:
- run a bounded log snapshot, not a long-lived stream:
./scripts/safe_docker_logs.sh jhf-warp-api 10m 200 20
- a bounded timeout counts as valid completion for the diagnostic snapshot as long as no lingering log readers remain
- ensure no lingering
docker logs,watch, ortail -fprocesses remain forjhf-warp - prefer one lightweight
docker stats --no-streamsample over sustained monitoring
Logging
The service currently relies on standard application logging and CI command output. There is no fully documented structured logging contract yet.
Minimum operator-useful logging should include:
- startup mode, runtime mode, and persistence mode
- outbound integration skip/fail/success outcomes
- control-agent cycle outcomes
- patch plan/apply guard decisions
Monitoring
Useful operator views today:
/health/ready/version/metrics/api/v1/runtime/inventory/api/v1/topology/diff/api/v1/drift/summary/api/v1/control-agent/status/api/v1/persistent-agents
Bounded verify/test stack handling:
- start only for explicit integration verification:
docker compose -f compose.integration.yml up --build -d
- stop immediately after the check window:
docker compose -f compose.integration.yml down -v
Minimum monitoring baseline today:
- self-description state from
/health,/ready,/version - normalized internal metrics from
/metrics - persistence mode and runtime mode from
/api/v1/runtime/inventory - drift severity from
/api/v1/drift/summary - rollout verification from
/api/v1/rollouts/audit - control-agent health and scheduler state from
/api/v1/control-agent/status - persistent-agent governance state from
/api/v1/persistent-agents
Metrics surface:
GET /metrics- Prometheus-style text payload
- protected as an internal read route by the same projected-authority boundary as other internal operator reads
- intentionally small: service, persistence, runtime, drift, outbox, control-agent, and governance counters only
Minimum alert/warning set that should be surfaced operationally:
fixture-memoryactive outside explicit development/test use- OpenClaw runtime unavailable or degraded
- failed or repeatedly skipped downstream delivery
- control-agent reconcile warnings, replay spikes, or watchdog growth
- OCI publish failures on release-oriented builds
Useful dashboard fields:
- Grafana:
- health/readiness state
- version and deployment image ref
- runtime mode and persistence mode
- drift severity trend
- integration failure counts
- Gitea:
- latest green CI state
- latest built image tag
- known blocker/warning summary
- last successful verification timestamp
- current
/metricsscrape timestamp or last successful metrics read
Recommended alert thresholds and warnings:
fixture-memoryactive outside explicit development/test use- repeated OpenClaw inventory failures
- non-zero drift severity that persists across checks
- repeated control-agent reconcile warnings, replay spikes, or watchdog growth
- repeated downstream integration failures or skipped deliveries
- publish-lane failure when a release-oriented OCI build is expected
Known Failure Modes
- service starts in fixture-memory mode because Postgres DSN is missing
- runtime inventory/drift degrade because OpenClaw host/runtime facts are unavailable
- outbound integration routes skip because downstream tokens or URLs are missing
- OCI publish job skips because
GITEA_PACKAGES_TOKENis not present in the runner context
Restart / Recovery
- restart the service process/container through the deployment system
- verify
/readyand/api/v1/runtime/inventoryafter restart - verify migration state before assuming persistence regressions
- regenerate a patch plan before any live runtime mutation retry
Runtime Dependencies
- Python runtime
- Postgres for durable state
- OpenClaw host/runtime for full orchestration value
- optional downstream integrations for setup/sync delivery
Implemented Vs Planned
Implemented today:
- local and stack deployment flows
- projected authority gating for internal read/write routes
- minimal metrics export at
/metrics - read-first Fabric self-description surfaces
Planned or external-only:
- Fabric registration or write-back control
- remote MCP server delivery
- operator-managed runner Postgres verification prerequisites
- operator-managed OCI publish credentials and downstream consumer rollout
Related Documents
- RUNBOOK.md
- SECURITY.md (
docs/SECURITY.md) - OPERATOR_RUNBOOK.md (
docs/OPERATOR_RUNBOOK.md) - OCI_IMAGE_PATH.md (
docs/OCI_IMAGE_PATH.md)
License
AGPLv3. See ../LICENSE (LICENSE).
Learn more at helpifyr.com.