OriginChain runs managed full-text search on the same database that backs
SQL and
vector search.
Boolean AND, BM25 (Lucene defaults k1=1.2, b=0.75), and phrase queries over a UAX #29 Unicode tokenizer.
Postings live under
h(tenant · "fts" · table · field · token) ‖ doc_id.
Per-doc length records (BM25 needs them) live under a sibling
"fts_doclen" shape. Both share the WAL,
recovery, and observability the rest of the engine has.
# postings — one entry per (token, doc_id)
h(tenant · "fts" · table · field · token) ‖ doc_id
# doclen — one entry per doc_id (BM25 length normalisation)
h(tenant · "fts_doclen" · table · field · doc_id)
Declaring the field, the tokenizer, and the analyzer pipeline happens on the manifest. See schemas → full-text fields. To insert documents, see insert → full-text. The rest of this page is the query reference.
Three query modes.
mode=boolean— posting-list intersection. AND of every term. Returns lexicographic doc_ids.
hits = oc.fts("posts", "body").search(
q="replication lag",
mode="bm25",
k=10,
)
for h in hits:
print(h.score, h.doc_id)
Score = Σ IDF(t) · (tf · (k1+1)) / (tf + k1 · (1 - b + b · dl/avgdl)).
Corpus stats (N, avgdl) are cached and refreshed on
re-index — no per-query prefix scan once cache is warm.
mode=phrase— exact contiguous-token match via per-token position-list intersection.
Each posting carries its position list. Phrase intersection walks
posL[i+1] - posL[i] = 1 for every adjacent pair.
Optional analyzer pipeline.
Tokens flow through whichever analyzer steps you list, in order. Index-time and query-time
use the same pipeline — never analyse one and not the other.
lowercaseFull Unicode case fold (handles Turkish dotted-i, German ß, etc).
fold_diacriticsNFKD + drop combining marks. "café" matches "cafe".
stop:<lang>Per-locale stop-word elimination. Same 18 languages as stemming.
stem:<lang>Snowball stemmer. "running" / "ran" / "runs" → one token.
Putting a row through /v1/rows/:t
auto-indexes every declared [[fts]]
field. The substrate keeps a per-doc token-set record so re-indexing a row that drops
a token also retires the stale posting in the same WAL frame — no ghost matches.
Corpus stats (N, avgdl) are cached and refreshed
incrementally on each index_field call.
Composition with SQL.
For facet counts, run a SQLGROUP BY on a normal column —
every row is visible to SQL the moment it lands. To return source snippets alongside hits, fetch the row by doc_id from /v1/rows/:t after scoring.
For pure-ASCII fast paths (SKU catalogs, log lines), set tokenizer = "ascii" in the manifest.