Ops — what to alert on, how to fail over, how to recover.
Everything below is what we run against managed instances. Most of it is automatic — the substrate self-heals on common failure modes — but when something needs hands on it, this is the playbook.
Health checks
| endpoint | what it tells you | alert when |
|---|---|---|
| /health | Liveness — process is up and the WAL is mountable. Returns 200 + JSON build/version. Use for ELB target-group health checks. | any 5xx or sustained timeout |
| /ready | Readiness — substrate has applied its log up to the latest sealed segment, lease is held, follower (if any) is connected. | non-200 for >30s after boot |
| /metrics | Prometheus scrape endpoint, ~15s cadence. Exposes per-tenant latency, cache, WAL, replication, and rate-limit counters. | see signal table below |
Signals worth paging on
| metric | what it tells you | alert when |
|---|---|---|
| oc_ask_latency_ms (p95) | End-to-end /ask path latency. | p95 > 100 ms for 5 min |
| oc_plan_cache_hit_ratio | Plan cache hits / total. EXPLAIN ANALYZE reads use the cache too. | < 0.9 for 15 min |
| oc_wal_fsync_lag_ms | WAL fsync lag behind durability policy. | > 50 ms for 1 min |
| oc_replication_applied_lag_lsn | Frames produced on writer minus frames applied on follower. | > 1000 for 30s |
| oc_checkpoint_age_s | Seconds since last successful checkpoint. | > 3600 |
| oc_rate_limit_rejections_total | Per-API-key 429s — usually a runaway client loop. | rate > 10/s |
| oc_circuit_open_total | Breaker opened on a downstream (LLM, S3, replication). | any spike > 0 |
Backups
Three layers, all S3-archived in the same region as the tenant.
- Sealed-segment shipping — default. Each WAL segment ships to S3 on rotation. Worst-case data loss is the unflushed tail of the active segment, typically minutes-to-hours on idle tenants.
- Checkpoint shipping — hourly auto-compaction snapshots
checkpoint.snap+ sidecar manifest. Restore picks the latest snapshot at-or-before the target and replays sealed segments up to it. - Restore-to-timestamp —
oc-pitr restore --target <rfc3339>rebuilds a fresh data dir from the chosen snapshot + segments. The result opens cleanly through the existing CRC-validating loaders.
AWS Backup vault keeps recovery points 30 days; manual purges on data-subject requests are documented in the runbook.
Continuous tail-shipping
The default ship-on-seal flow gives PITR granularity at the WAL rotation
cadence — fine for most tenants, too coarse for compliance-heavy ones.
Tail-shipping uploads the cumulative bytes of the active segment
every window ms (default 500), driving worst-case data loss to ~0.5–1.5 s.
On the managed service the tail-shipper runs as a built-in
tokio task on
oc-server alongside restore-side replay (auto-replay of the latest tail at-or-before the target).
Failover
When the writer is wedged or the underlying EC2 is silent, we promote the
in-region follower via scripts/promote-follower.sh. RTO is ~25s, drilled twice on
live-test-1 as of 2026-04-30.
The flow:
- Fence the old writer — SSM
systemctl stop oc-http.service. The old writer's lease-heartbeat would self-fence on next tick anyway; stopping the service is faster + deterministic. - Read the writer's ExecStart — strip
--mode follower+--leader-addr, preserve sync-rep flags, TLS paths, LLM config. - Bump replication-epoch — fences any zombie writer that comes back later.
- Restart the follower in writer mode and smoke
/health. - UPSERT Route 53 at the new writer's public IP, TTL 60s.
Manual intervention looks like: page on oc-controlplane-instance-silent, confirm the writer is genuinely
dead (not a transient network blip), run the script with
--writer-instance + --follower-instance, watch the four step-prints, verify
/health on the new endpoint.
When not to use it: the writer is responding but slow
(load-shed, don't fail over); the follower's applied_lsn is far behind (you'll lose data — investigate replication first); during an
active online schema migration in Backfilling state.
Schema migrations
Online migrations are first-class — no read-only window, no service bounce. The model:
- BackfillRateLimiter — capped at 10% of writer throughput so live traffic stays prioritised.
- Dual-read transform — readers see the v1 shape during backfill via an in-memory transform applied to v0 rows.
- Atomic cutover — version bump is a single WAL frame; reads switch to v1-native on the next read.
- Abort-only-pre-cutover — once cutover lands, the only path forward is an inverse-rewrite migration. Aborts in
Backfillingstate are safe and reversible.
Use online migrations any time the manifest version bumps. Stuck-in-Backfilling troubleshooting is in incident response → RUNBOOK.
Observability
- EXPLAIN ANALYZE — append
?explain=trueto any read endpoint. Returns the plan tree, per-stage row-counts, and elapsed micros. EXPLAIN reads hit the plan cache like normal queries; consultoc_plan_cache_hit_ratio. - /watch SSE — server-sent-events stream of cache-invalidation messages. Plumb it into your app cache to evict stale rows the moment a write commits.
- OTLP push tracing — tail-based sampling. Configure the collector endpoint via the console; we push spans on every
/v1request, with the heavy-tail sampler retaining slow + error traces in full. - /metrics — Prometheus scrape, see signal table above.
Incident response
The full playbook is in docs/RUNBOOK.md. Highlights:
- Pager severity: sev 1 (tenant down) → 5 min ack, 1 hr fix-or-mitigate. Sev 2 (one alarm tripped) → 15 min / 4 hr. Sev 3 (drift) → next business day.
- Status page: publishes per-region health and incident timelines. Subscribers get email + webhook on any sev 1.
- Stuck-in-Backfilling: abort the migration via
POST /v1/tenants/:t/migrations/:id/abort, then resubmit. - Order of operations: stop the bleeding, find root cause, write a postmortem, fix the underlying problem. Quiet incidents become loud ones.
Incidents today are handled by core engineering during extended business hours with best-effort overnight coverage; the pager-severity SLAs above are the targets we hold ourselves to. 24/7 named-engineer coverage is available on Enterprise — contact sales.
Compliance posture
- SOC 2 Type 1: underway with Vanta/Drata and an external CPA. Contact for audit timeline; the in-flight gap analysis is available to procurement under NDA.
- HIPAA BAA: available on Enterprise. PHI workloads must run in a region the BAA covers and on a dedicated-capacity instance.
- GDPR DPA: available on Enterprise. EU-region (Frankfurt) instances support the DPA out of the box; data-subject deletion is documented in
RUNBOOK.md.