07 · ops & runbook

Ops — what to alert on, how to fail over, how to recover.

Everything below is what we run against managed instances. Most of it is automatic — the substrate self-heals on common failure modes — but when something needs hands on it, this is the playbook.

Health checks

endpoint	what it tells you	alert when
/health	Liveness — process is up and the WAL is mountable. Returns 200 + JSON build/version. Use for ELB target-group health checks.	any 5xx or sustained timeout
/ready	Readiness — substrate has applied its log up to the latest sealed segment, lease is held, follower (if any) is connected.	non-200 for >30s after boot
/metrics	Prometheus scrape endpoint, ~15s cadence. Exposes per-tenant latency, cache, WAL, replication, and rate-limit counters.	see signal table below

Signals worth paging on

metric	what it tells you	alert when
oc_ask_latency_ms (p95)	End-to-end /ask path latency.	p95 > 100 ms for 5 min
oc_plan_cache_hit_ratio	Plan cache hits / total. EXPLAIN ANALYZE reads use the cache too.	< 0.9 for 15 min
oc_wal_fsync_lag_ms	WAL fsync lag behind durability policy.	> 50 ms for 1 min
oc_replication_applied_lag_lsn	Frames produced on writer minus frames applied on follower.	> 1000 for 30s
oc_checkpoint_age_s	Seconds since last successful checkpoint.	> 3600
oc_rate_limit_rejections_total	Per-API-key 429s — usually a runaway client loop.	rate > 10/s
oc_circuit_open_total	Breaker opened on a downstream (LLM, S3, replication).	any spike > 0

Backups

Three layers, all S3-archived in the same region as the tenant.

Sealed-segment shipping — default. Each WAL segment ships to S3 on rotation. Worst-case data loss is the unflushed tail of the active segment, typically minutes-to-hours on idle tenants.
Checkpoint shipping — hourly auto-compaction snapshots checkpoint.snap + sidecar manifest. Restore picks the latest snapshot at-or-before the target and replays sealed segments up to it.
Restore-to-timestamp — oc-pitr restore --target <rfc3339> rebuilds a fresh data dir from the chosen snapshot + segments. The result opens cleanly through the existing CRC-validating loaders.

AWS Backup vault keeps recovery points 30 days; manual purges on data-subject requests are documented in the runbook.

Continuous tail-shipping

Opt-in via the Intra-Segment PITR add-on

The default ship-on-seal flow gives PITR granularity at the WAL rotation cadence — fine for most tenants, too coarse for compliance-heavy ones. Tail-shipping uploads the cumulative bytes of the active segment every window ms (default 500), driving worst-case data loss to ~0.5–1.5 s.

On the managed service the tail-shipper runs as a built-in tokio task on oc-server alongside restore-side replay (auto-replay of the latest tail at-or-before the target).

Failover

When the writer is wedged or the underlying EC2 is silent, we promote the in-region follower via scripts/promote-follower.sh. RTO is ~25s, drilled twice on live-test-1 as of 2026-04-30.

The flow:

Fence the old writer — SSM systemctl stop oc-http.service. The old writer's lease-heartbeat would self-fence on next tick anyway; stopping the service is faster + deterministic.
Read the writer's ExecStart — strip --mode follower + --leader-addr, preserve sync-rep flags, TLS paths, LLM config.
Bump replication-epoch — fences any zombie writer that comes back later.
Restart the follower in writer mode and smoke /health.
UPSERT Route 53 at the new writer's public IP, TTL 60s.

Manual intervention looks like: page on oc-controlplane-instance-silent, confirm the writer is genuinely dead (not a transient network blip), run the script with --writer-instance + --follower-instance, watch the four step-prints, verify /health on the new endpoint.

When not to use it: the writer is responding but slow (load-shed, don't fail over); the follower's applied_lsn is far behind (you'll lose data — investigate replication first); during an active online schema migration in Backfilling state.

Schema migrations

Online migrations are first-class — no read-only window, no service bounce. The model:

BackfillRateLimiter — capped at 10% of writer throughput so live traffic stays prioritised.
Dual-read transform — readers see the v1 shape during backfill via an in-memory transform applied to v0 rows.
Atomic cutover — version bump is a single WAL frame; reads switch to v1-native on the next read.
Abort-only-pre-cutover — once cutover lands, the only path forward is an inverse-rewrite migration. Aborts in Backfilling state are safe and reversible.

Use online migrations any time the manifest version bumps. Stuck-in-Backfilling troubleshooting is in incident response → RUNBOOK.

Observability

EXPLAIN ANALYZE — append ?explain=true to any read endpoint. Returns the plan tree, per-stage row-counts, and elapsed micros. EXPLAIN reads hit the plan cache like normal queries; consult oc_plan_cache_hit_ratio.
/watch SSE — server-sent-events stream of cache-invalidation messages. Plumb it into your app cache to evict stale rows the moment a write commits.
OTLP push tracing — tail-based sampling. Configure the collector endpoint via the console; we push spans on every /v1 request, with the heavy-tail sampler retaining slow + error traces in full.
/metrics — Prometheus scrape, see signal table above.

Incident response

The full playbook is in docs/RUNBOOK.md. Highlights:

Pager severity: sev 1 (tenant down) → 5 min ack, 1 hr fix-or-mitigate. Sev 2 (one alarm tripped) → 15 min / 4 hr. Sev 3 (drift) → next business day.
Status page: publishes per-region health and incident timelines. Subscribers get email + webhook on any sev 1.
Stuck-in-Backfilling: abort the migration via POST /v1/tenants/:t/migrations/:id/abort, then resubmit.
Order of operations: stop the bleeding, find root cause, write a postmortem, fix the underlying problem. Quiet incidents become loud ones.

Incidents today are handled by core engineering during extended business hours with best-effort overnight coverage; the pager-severity SLAs above are the targets we hold ourselves to. 24/7 named-engineer coverage is available on Enterprise — contact sales.

Compliance posture

SOC 2 Type 1: underway with Vanta/Drata and an external CPA. Contact for audit timeline; the in-flight gap analysis is available to procurement under NDA.
HIPAA BAA: available on Enterprise. PHI workloads must run in a region the BAA covers and on a dedicated-capacity instance.
GDPR DPA: available on Enterprise. EU-region (Frankfurt) instances support the DPA out of the box; data-subject deletion is documented in RUNBOOK.md.