OriginChain docs
by query shape · full-text

Full-text search on OriginChain.

OriginChain runs managed full-text search on the same database that backs SQL and vector search. Boolean AND, BM25 (Lucene defaults k1=1.2, b=0.75), and phrase queries over a UAX #29 Unicode tokenizer. Postings live under h(tenant · "fts" · table · field · token) ‖ doc_id. Per-doc length records (BM25 needs them) live under a sibling "fts_doclen" shape. Both share the WAL, recovery, and observability the rest of the engine has.

# postings  — one entry per (token, doc_id)
h(tenant · "fts" · table · field · token) ‖ doc_id

# doclen   — one entry per doc_id (BM25 length normalisation)
h(tenant · "fts_doclen" · table · field · doc_id)

Declaring the field, the tokenizer, and the analyzer pipeline happens on the manifest. See schemas → full-text fields. To insert documents, see insert → full-text. The rest of this page is the query reference.

Three query modes.

mode=boolean — posting-list intersection. AND of every term. Returns lexicographic doc_ids.
curl "https://acme.ap-south-1.db.originchain.ai/v1/tenants/$T/fts/posts/body?mode=boolean&q=postgres+wal" \
  -H "Authorization: Bearer $OC_TOKEN"
mode=bm25 — ranked retrieval. Lucene defaults k1=1.2, b=0.75. Top-k cap, IDF + length norm.
curl "https://acme.ap-south-1.db.originchain.ai/v1/tenants/$T/fts/posts/body?mode=bm25&q=replication+lag&k=10" \
  -H "Authorization: Bearer $OC_TOKEN"

Score = Σ IDF(t) · (tf · (k1+1)) / (tf + k1 · (1 - b + b · dl/avgdl)). Corpus stats (N, avgdl) are cached and refreshed on re-index — no per-query prefix scan once cache is warm.

mode=phrase — exact contiguous-token match via per-token position-list intersection.
curl "https://acme.ap-south-1.db.originchain.ai/v1/tenants/$T/fts/posts/body?mode=phrase&q=write+ahead+log" \
  -H "Authorization: Bearer $OC_TOKEN"

Each posting carries its position list. Phrase intersection walks posL[i+1] - posL[i] = 1 for every adjacent pair.

Optional analyzer pipeline.

Tokens flow through whichever analyzer steps you list, in order. Index-time and query-time use the same pipeline — never analyse one and not the other.

  • lowercase Full Unicode case fold (handles Turkish dotted-i, German ß, etc).
  • fold_diacritics NFKD + drop combining marks. "café" matches "cafe".
  • stop:<lang> Per-locale stop-word elimination. Same 18 languages as stemming.
  • stem:<lang> Snowball stemmer. "running" / "ran" / "runs" → one token.
stemming & stop-words available for
ArabicDanishDutchEnglishFinnishFrenchGermanHungarianItalianNorwegianPortugueseRomanianRussianSpanishSwedishTamilTurkishHindi

Re-indexing contract.

Putting a row through /v1/rows/:t auto-indexes every declared [[fts]] field. The substrate keeps a per-doc token-set record so re-indexing a row that drops a token also retires the stale posting in the same WAL frame — no ghost matches. Corpus stats (N, avgdl) are cached and refreshed incrementally on each index_field call.

Composition with SQL.

For facet counts, run a SQL GROUP BY on a normal column — every row is visible to SQL the moment it lands. To return source snippets alongside hits, fetch the row by doc_id from /v1/rows/:t after scoring. For pure-ASCII fast paths (SKU catalogs, log lines), set tokenizer = "ascii" in the manifest.