architecture

How OriginChain works under the hood.

OriginChain is one database that answers SQL, vector search, BM25 full-text, graph traversal, and natural language — through a single bearer token, against a single endpoint. Underneath, every query shape runs against the same hash-keyed substrate, so your rows, indexes, embeddings, and edges share one log, one recovery path, and one consistency story. No cross-engine sync. No drift. No second system to learn.

01 · the substrate

One AI-native database. One log. Atomic commits.

The storage layer is a single key-value store. Every key is a 16-byte BLAKE3 prefix over its logical identity; every value is a length-prefixed, CRC32C-framed blob. The index lives in memory for O(1) lookup, the data lives on disk as packed columnar chunks, and a write-ahead log is the single source of truth. Group commit merges concurrent writes into one fsync per window so write throughput scales with vCPU instead of bottlenecking on a single core.

Each managed instance runs on its own EC2 box, its own VPC, its own EBS volume, its own TLS certificate. Tenancy is physical, not logical — there is no shared load balancer, no shared disk, and no shared memory between customers. A managed box in eu-west-1 sends bytes only to eu-west-1, so data residency is structural rather than configurable.

Fig. 01 — substrate layout: hash-keyed k/v store with WAL

OriginChain's storage layer is a single key-value store with a 16-byte BLAKE3 prefix per key. Domain prefixes carve the keyspace so rows, indexes, relations, vectors, and full-text postings all share the same WAL.

02 · key shapes

Ten shapes. One vector database. One graph. One full-text index.

Domain-prefixed keys give OriginChain atomic writes across every derived structure your data needs. Schemas, rows, secondary indexes, forward and reverse edges, BM25 postings, vectors, HNSW graphs, and columnar chunks all live in the same hash-keyed store under a different domain tag. The 16-byte hashed prefix makes each layer enumerable by prefix scan; the suffix makes each entry unique. Add a vector column or a relation and your write path stays one batch — no sidecar engine, no double-bookkeeping, no eventual-consistency window between your row and its index.

row

Rows

h(tenant · "row" · table) ‖ pk

The canonical row layer. The 16-byte hashed prefix is shared across every row in a (tenant, table), so a prefix scan walks the table in primary-key order. The pk is appended raw so full-table enumeration and range scans both work without a second indirection.

idx

Secondary indexes

h(tenant · "idx" · table · index) ‖ values ‖ pk

One key per (declared index, values, pk) tuple. Lookups are prefix scans over the values portion; the pk suffix lets the executor jump back to the row layer in a single key fetch.

rel · fwd

Forward relations

h(tenant · "rel" · "fwd" · table · rel · src) ‖ dst

Outgoing edges. The explicit direction tag keeps self-relations safe — "manager_of" forward is keyed differently from "manager_of" reverse on the same table — so graph traversals never collide with their mirrors.

rel · rev

Reverse relations

h(tenant · "rel" · "rev" · target · rel · dst) ‖ src

Incoming edges. Forward and reverse traversals are both prefix scans over a 16-byte hashed prefix, so "orders for client C" and "client for order O" cost the same — no edge table double-write, no graph sidecar.

fts

FTS postings

h(tenant · "fts" · table · field · token) ‖ doc_id

One key per (token, doc) posting. BM25 ranking, boolean AND, and phrase queries all walk this prefix; phrase intersection consults the per-posting position list stored in the value.

fts_doclen

FTS document lengths

h(tenant · "fts_doclen" · table · field) ‖ doc_id

Per-document field length, used by BM25 normalisation. Kept in its own domain so the postings prefix stays clean for the hot path.

fts_corpus

FTS corpus stats

h(tenant · "fts_corpus" · table · field)

Cached avgdl + N for BM25. Updated atomically with each indexed write so ranking weights never drift mid-query.

vec

Vector embeddings

h(tenant · "vec" · table) ‖ id

f32 vectors, length-prefixed and CRC-framed like every other value. Cosine, dot, and L2 are SIMD-kernel primitives over the deserialised graph cache.

vec_idx

HNSW graph

h(tenant · "vec_idx" · table) ‖ "graph"

The HNSW graph blob. One value per (tenant, table) — opened on instance start, deserialised into the in-memory cache, and rebuilt on schema change. Tunable speed/recall: default high_recall mode hits recall@10 = 0.96 at 100k with p99 109 ms; fast mode runs p99 37 ms at recall 0.69.

chunk

OCC columnar chunks

h(tenant · "chunk" · table · column · seg) ‖ stats

Packed columnar segments — Dict + Bitpack + Delta encodings, picked per column by smallest body wins. The stats sidecar stores per-segment min / max / HLL so the cost model can prune segments before opening them.

h(...) is a 16-byte BLAKE3 prefix. ‖ is byte concatenation. Every shape is a prefix-scan target — that is what lets a hash-keyed k/v store serve rows, indexes, graphs, FTS postings, and vectors out of one log.

03 · query engine

One plan tree. Vector, SQL, BM25, graph, English.

Every query — SQL, vector topk, BM25 full-text, graph traversal, or natural language — compiles to the same JSON-serialisable Plan AST. The same volcano-style executor walks rows, indexes, postings, and chunks; predicate rewriting folds constants and pushes filters under joins before the cost model ranks shapes. Numeric filters run through SIMD kernels (AVX2 on Intel, NEON on ARM, portable std::simd elsewhere), and chained joins build a left-deep tree up to five tables.

A natural-language question lands in the same Plan tree as a hand-written SQL statement, so an LLM-generated query gets the same cost model, the same EXPLAIN output, and the same per-node statistics as anything you write yourself. There is no LLM in the hot path: the model emits a plan, the executor runs it.

Scan / IndexScan

Walks the row or index domain by prefix. The cost model picks between full scan and index scan from per-segment histograms, never from a guess.

ColumnScan

Streams packed segments and per-row chunks in deterministic order, decoding only the requested columns. Stats-sidecar pruning skips provably-empty segments before any I/O.

Filter (And / Or / In / Eq / Gt / Lt)

Predicate sub-tree with short-circuiting And/Or and first-class set membership. Numeric predicates run through SIMD kernels — AVX2 on Intel, NEON on ARM, portable std::simd everywhere else.

HashJoin (INNER · LEFT · RIGHT · FULL)

Two-way hash join is the leaf; chained joins build a left-deep tree up to five tables. OUTER variants null-fill missing sides exactly once at materialisation, so a missing right side never duplicates the left.

Aggregate (GROUP BY + COUNT/SUM/AVG/MIN/MAX) + HAVING

Hash aggregation over the keyset; HAVING is a Filter applied to the aggregate output, not the input rows. Empty groups never appear in the result.

Sort + Limit

Postgres-style null ordering (nulls high, ASC gets NULLS LAST). Cross-type compares collapse to Equal rather than lying about order. Limit is pushed through Sort whenever correctness allows.

Fig. 02 — chained 3-way join (left-deep)

Every query — SQL, vector topk, BM25, graph traversal, or natural language — compiles to the same Plan tree. The same executor walks rows, indexes, and chunks.

04 · ingest path

One row in. Every derived structure, atomically.

A single put_row writes the row, every secondary index, every forward and reverse edge, the per-row OCC chunk for declared extractions, the BM25 postings, and the vector embedding — all in one Store::write_batch. The whole batch lands as one WAL frame, hits one fsync, and broadcasts to the follower as one unit.

Your derived structures never drift from your rows. A torn frame drops the entire batch on recovery, so there is no half-written state where the row exists but its index does not. Edges are diffed before write, so reassigning a foreign key never leaks a stale reverse edge.

Fig. 03 — POST /v1/tenants/:t/rows

One row in. One atomic batch covering rows, indexes, relations, columnar chunks, full-text postings, and the vector. One WAL frame. One replication broadcast. Your derived structures cannot drift from your rows.

05 · replication & failover

Active-passive replication. RPO = 0. ~25 s RTO.

Every committed WAL frame ships to a passive follower in real time. async, sync_one, and sync_quorum are per-write opt-ins, so you tune latency and durability per call. On paid tiers, sync replication delivers RPO = 0 — no acknowledged write is ever lost on writer failure.

An S3-backed lease is the strongly-consistent arbiter of which node is primary. When the writer dies, the lease expires, the follower takes the lease, and is promoted within ~25 seconds. New replicas bootstrap via a Frame::Snapshot transfer rather than replaying the entire log, so adding capacity does not stall your writer.

Fig. 04 — failover via S3 lease

Active-passive sync replication with RPO = 0 on paid tiers. New followers bootstrap via Frame::Snapshot transfer. Writer failure to follower-promoted: ~25 seconds.

06 · recovery

Point-in-time recovery. Restore to a moment.

WAL frame v2 embeds a timestamp_micros field at write time, so every byte that lands in the log knows when it happened. Sealed segments ship to S3 continuously, and a tail-shipper streams the open segment between seals — your archive is never more than a few seconds behind production.

Restore-to-timestamp at sealed-segment granularity is a one-click action in the dashboard or a single HTTP call. The platform reconstructs the database into a fresh instance at the requested moment, replaying frames up to the chosen segment boundary. Roll back a bad migration, recover a corrupted tenant, or branch a forensic copy — without touching the live writer.

WAL framing	length · payload · CRC32C
Frame version	v2 · embedded timestamp_micros
Group commit	100 µs OR 256-writer window · single fsync
Bootstrap	Frame::Snapshot transfer to new follower
RPO	0 on paid tiers (sync_one / sync_quorum)
RTO	~25 s from S3-lease takeover to follower-promoted
Fuzzing	1 M deterministic crash iterations / CI run
PITR	restore-to-timestamp at sealed-segment granularity
Archive	sealed-segment ship + continuous tail to S3

07 · observability

Every query. Every span. Your stack.

Every query carries a query-id returned in X-OC-Query-Id. Append ?explain=true to any endpoint and the response includes the plan tree annotated with estimated rows, actual rows, time, chunks read, and segments pruned by the stats sidecar — no separate EXPLAIN endpoint, no separate auth path. Cancel an in-flight query by id and the response closes cleanly. Spans push to your OTLP collector with tail-based sampling, so every slow query gets traced automatically while fast queries stay cheap.

EXPLAIN ANALYZE

Plan tree with per-node RunStats: estimated rows, actual rows, time, chunks read, segments pruned. One query parameter, no separate endpoint.

OTLP push

Spans push to your OTLP collector — Honeycomb, Datadog, Grafana Cloud, Tempo, Jaeger. Tail-based sampling: every slow and every NL query traced; fast queries 1 % probabilistic.

/metrics

Prometheus-format scrape: per-key bucket fill, group-commit window p50 / p99, WAL fsync latency, replication lag, plan cache hit rate.

/watch SSE

Subscribe to any plan over Server-Sent Events. The WAL is the stream — no polling loop, no CDC sidecar, no second engine to keep coherent.

One database for your whole AI stack.

SQL, vector search, BM25, graph, and natural language against the same managed database — one bearer, one endpoint, one instance per tenant. A managed instance comes online in about ninety seconds, and the quickstart walks you from signup to your first English query in under ten minutes.

Read the quickstart See pricing Browse docs