=========================================
Choosing an engine: semlix core vs bm25s
=========================================

semlix ships **two lexical engines** behind the same ``Index`` protocol. The
semantic / hybrid layer rides on either, so this choice is about the *lexical*
half of search.

Benchmark
=========

Measured with ``benchmark/bench_engines.py`` on 20,000 documents (needs
``pip install bm25s PyStemmer``):

.. list-table::
   :header-rows: 1

   * - engine
     - indexing
     - query p50
     - query mean
   * - semlix core (``semlix.index``)
     - ~1,400 docs/s
     - ~4.5 ms
     - ~4.5 ms
   * - **bm25s** (``semlix.bm25``)
     - **~56,000 docs/s**
     - **~0.08 ms**
     - **~0.08 ms**
   * - ratio
     - **~40x faster**
     - **~57x faster**
     - --

The bm25s engine scores BM25 over an in-memory matrix with native (numpy)
code; the semlix core is a pure-Python, disk-based inverted index. bm25s is the
fast path — but it buys that speed by doing less.

Trade-offs
==========

**bm25s** (``semlix.bm25.create_bm25_index``)

* In-memory: the corpus must fit in RAM; the index is rebuilt on each
  ``commit()`` (great for batch/bulk loads, poor for frequent small updates).
* Bag-of-words BM25 only: **no** phrase/proximity, positions, faceting,
  highlighting, spelling, or the full query language.

**semlix core** (``semlix.index.create_in``)

* Disk-based, incremental, segment-merging — scales past RAM, supports live
  add/update/delete without a full rebuild.
* Full query language, phrases/positions, faceting, collapsing, highlighting,
  spelling, spans, nested docs; pluggable analyzers, scoring and codecs.

When to use which
=================

Use **bm25s** when you want maximum lexical throughput/latency, the corpus fits
in memory, BM25 over words is enough (typical RAG / the lexical leg of hybrid
search), and updates are batchy.

Use the **semlix core** when you need the rich query language, phrases/positions,
facets, highlighting, spelling or custom analyzers/scoring, or when the index is
larger than RAM or needs incremental live updates.

For **hybrid / semantic search**, :class:`~semlix.semantic.HybridSearcher`
accepts either as its lexical ``index``. The :mod:`semlix.unified` package wires
**bm25s + pgvector** together for the fast-lexical + semantic combination.

.. tip::

   For a fast lexical or hybrid service, default to **bm25s**; reach for the
   semlix core only when a feature it provides is actually required.