OriginChain docs
02 · core concepts

Core concepts.

OriginChain is a hash-keyed key/value substrate with a Plan tree on top. SQL, vector, full-text, and graph are not separate engines — they are different key shapes and different Plan operators over the same store. Understand the substrate, the keys, and the Plan and the rest of the surface follows.

01 · the substrate

A single hash-keyed k/v store.

The engine is a single B-tree-free, hash-indexed key/value store fronted by a write-ahead log. Every write is appended to the WAL, fsynced, then applied. Reads go through a process-wide page cache. There is no row-store / column-store / vector-engine split — every domain is a different prefix on the same keyspace.

Each tenant gets a single, region-isolated EC2 instance. No shared compute, no noisy neighbour. Writes go to one primary; an optional sync follower ships WAL frames in lockstep for RPO=0 paid tiers. The follower bootstraps from a Frame::Snapshot transfer and then tails.

Fig. — request path: HTTP → plan → engine → WAL → follower + S3
one EC2 instance per tenant · region-isolated · one VPC · one EBS client HTTP Plan tree Engine WAL fsync · group commit page cache EBS-backed sealed segs follower EC2 · sync frames checkpoints + sealed segs tail-shipper (continuous) S3 archive oc-controlplane-engine
02 · schemas

Declared in TOML.

A schema manifest declares columns, indexes, relations to other schemas, and extractions (chunked text fields, vector fields, FTS fields). The catalog is itself stored as rows — adding a field is a write, not a downtime migration.

# schemas/orders.toml
name    = "orders"
version = 1

[[columns]]
name = "id"        ; type = "ulid"     ; pk = true
[[columns]]
name = "customer"  ; type = "ulid"
[[columns]]
name = "amount"    ; type = "decimal"
[[columns]]
name = "status"    ; type = "string"
[[columns]]
name = "placed"    ; type = "timestamp"

[[indexes]]
columns = ["status"]
[[indexes]]
columns = ["customer", "placed"]

[[relations]]
edge   = "customer"
target = "customers"     # rel|fwd|orders|customer|<order>|<customer>

[[extractions.fts]]
field    = "notes"
analyzer = "english"     # snowball stem + diacritics fold + stop-words

[[extractions.vector]]
field  = "summary_embedding"
dim    = 1024
metric = "cosine"

Indexes, relations, and extractions are honoured at write time — no separate "build index" step. Online migrations follow a strict contract: monotonic version int, one of four allowed shapes per migration, eager 10% backfill, dual-read transform, atomic cutover. See ops → migrations.

03 · key shapes

Ten domain prefixes in production.

Every byte stored on disk lives under one of these prefixes. SQL reads row|* and idx|*. Graph reads rel|*. Full-text reads fts*. Vector reads vec*. The plan cache is intent|*.

Prefix Byte layout Purpose
row row|<schema>|<pk_bytes> The primary user-facing record. PK can be ULID, UUID, string, or composite. Encodes a single row as MessagePack.
idx idx|<schema>|<column>|<value_bytes>|<pk_bytes> Secondary index entries. Hash-keyed range scans work via prefix iteration on (schema, column, value).
rel rel|fwd|<schema>|<edge>|<src_pk>|<dst_pk> · rel|rev|<schema>|<edge>|<dst_pk>|<src_pk> Edges between rows. Always written in pairs (forward + reverse) so neighbours and reverse-neighbours are both O(prefix-scan).
chunk chunk|<schema>|<row_pk>|<seq_u32> Document chunks for FTS / vector — splits long text fields into addressable units while keeping the parent row intact.
fts fts|<schema>|<field>|<token>|<row_pk>|<positions> Inverted-index posting list. Token is post-tokeniser (UAX #29 + optional Snowball stem). Positions back phrase queries.
fts_doclen fts_doclen|<schema>|<field>|<row_pk> Per-doc length cache for BM25 scoring. Updated atomically with each fts insert.
fts_corpus fts_corpus|<schema>|<field> Corpus-wide stats: doc count, total token count, average doc length. One key per (schema, field).
vec vec|<schema>|<field>|<row_pk> Raw f32 embedding vector — the value the SIMD distance kernel reads.
vec_idx vec_idx|<schema>|<field>|<segment_id> Serialised HNSW graph segment. Loaded once per process into the graph cache, evicted on schema migration.
intent intent|<question_hash> Plan cache entry — the compiled Plan tree for a question template. Skips the LLM compile on cache hit.
04 · the plan tree

Eleven operators, one tree.

Both /sql and /ask compile to the same Plan tree. The tree is JSON-serialisable, cached by question hash under intent|*, and replayable. Every shipped query shape is one of these operators or a composition.

Scan
Full prefix scan of a row keyspace. The fallback when no index applies.
ColumnScan
Projection-aware scan that decodes only the requested fields from the MessagePack body.
IndexScan
Hash-keyed lookup on an idx prefix. Used when WHERE has an indexed equality.
Filter
Predicate evaluation. Pushed under projection where possible by the optimiser.
Project
Column selection — drops fields the user did not ask for before they reach the wire.
Limit
Truncates the stream. Pushed below sort when the sort key admits a top-K shortcut.
Sort
External-merge sort with spill-to-disk over the sealed-segment cache.
Aggregate
GROUP BY with COUNT / SUM / AVG / MIN / MAX. HAVING evaluates after the aggregate buckets finalise.
HashJoin
Build-side hash table on the smaller input, probe with the larger. INNER joins between two tables.
OuterJoin
LEFT, RIGHT, and FULL variants — emits NULL-filled rows for unmatched probe entries. Up to 5-table left-deep chains.
RelationHop
Walks rel|fwd or rel|rev edges. Powers neighbours, BFS, path, and Dijkstra.
-- SELECT c.name, SUM(o.amount) FROM orders o
-- JOIN customers c ON c.id = o.customer
-- WHERE o.status = 'paid' GROUP BY c.name HAVING SUM(o.amount) > 1000;

Aggregate { group: [c.name], having: SUM(o.amount) > 1000 }
└── Project { c.name, o.amount }
    └── HashJoin { o.customer = c.id }
        ├── IndexScan { idx|orders|status = "paid" }
        └── Scan { row|customers }
05 · replication model

Active-passive, sync.

One primary, one optional sync follower. WAL frames replicate before the primary returns 200. A follower joining a running cluster bootstraps via Frame::Snapshot — a chunked transfer of every key in the store — then tails the live frame stream from the snapshot's LSN.

Mode RPO RTO Notes
Primary only ~0.5s (WAL fsync) ~5–10 min (S3 restore) Whisper tier default. Restore replays sealed segments + tail.
Sync follower 0 ~25s (drilled) Thunder, Storm, and Enterprise tiers. Verified end-to-end with snapshot bootstrap.

Active-passive sync replication is the production path. Commits durably ack only after the follower has the frame on disk. RPO=0, RTO ~25 s. See ops → failover for the promotion procedure.

06 · versioning

Single-row optimistic CAS.

Every row carries an internal _oc_row_version field. The API exposes put_row_cas, get_row_versioned, and delete_row_cas for optimistic concurrency. A CAS that loses the race fails the entire batch with a deterministic error — no partial application. Idempotency keys make retries safe; the same key plus the same body returns the original response, a different body with the same key returns 409.

07 · backups & pitr

Sealed segments + continuous tail.

Two streams flow to S3 in parallel: sealed WAL segments shipped on roll, and a continuous tail-shipper that flushes the open segment every few hundred milliseconds. WAL frame v2 carries an embedded timestamp_micros so restore-to-timestamp resolves below the segment boundary.

# restore an instance to a wall-clock timestamp
oc-pitr restore \
  --tenant acme \
  --target "2026-04-29T18:42:00Z" \
  --into   acme-restore-001

Sealed-segment PITR (segment-boundary granularity, ~5–10 min restore window) is included on every tier. Intra-segment LSN-precise PITR (sub-second granularity, ~0.5–1.5 s data-loss window) is a paid add-on — see pricing.