========================================= Choosing an engine: semlix core vs bm25s ========================================= semlix ships **two lexical engines** behind the same ``Index`` protocol. The semantic / hybrid layer rides on either, so this choice is about the *lexical* half of search. Benchmark ========= Measured with ``benchmark/bench_engines.py`` on 20,000 documents (needs ``pip install bm25s PyStemmer``): .. list-table:: :header-rows: 1 * - engine - indexing - query p50 - query mean * - semlix core (``semlix.index``) - ~1,400 docs/s - ~4.5 ms - ~4.5 ms * - **bm25s** (``semlix.bm25``) - **~56,000 docs/s** - **~0.08 ms** - **~0.08 ms** * - ratio - **~40x faster** - **~57x faster** - -- The bm25s engine scores BM25 over an in-memory matrix with native (numpy) code; the semlix core is a pure-Python, disk-based inverted index. bm25s is the fast path — but it buys that speed by doing less. Trade-offs ========== **bm25s** (``semlix.bm25.create_bm25_index``) * In-memory: the corpus must fit in RAM; the index is rebuilt on each ``commit()`` (great for batch/bulk loads, poor for frequent small updates). * Bag-of-words BM25 only: **no** phrase/proximity, positions, faceting, highlighting, spelling, or the full query language. **semlix core** (``semlix.index.create_in``) * Disk-based, incremental, segment-merging — scales past RAM, supports live add/update/delete without a full rebuild. * Full query language, phrases/positions, faceting, collapsing, highlighting, spelling, spans, nested docs; pluggable analyzers, scoring and codecs. When to use which ================= Use **bm25s** when you want maximum lexical throughput/latency, the corpus fits in memory, BM25 over words is enough (typical RAG / the lexical leg of hybrid search), and updates are batchy. Use the **semlix core** when you need the rich query language, phrases/positions, facets, highlighting, spelling or custom analyzers/scoring, or when the index is larger than RAM or needs incremental live updates. For **hybrid / semantic search**, :class:`~semlix.semantic.HybridSearcher` accepts either as its lexical ``index``. The :mod:`semlix.unified` package wires **bm25s + pgvector** together for the fast-lexical + semantic combination. .. tip:: For a fast lexical or hybrid service, default to **bm25s**; reach for the semlix core only when a feature it provides is actually required.