Choosing an engine: semlix core vs bm25s

semlix ships two lexical engines behind the same Index protocol. The semantic / hybrid layer rides on either, so this choice is about the lexical half of search.

Benchmark

Measured with benchmark/bench_engines.py on 20,000 documents (needs pip install bm25s PyStemmer):

engine

indexing

query p50

query mean

semlix core (semlix.index)

~1,400 docs/s

~4.5 ms

~4.5 ms

bm25s (semlix.bm25)

~56,000 docs/s

~0.08 ms

~0.08 ms

ratio

~40x faster

~57x faster

The bm25s engine scores BM25 over an in-memory matrix with native (numpy) code; the semlix core is a pure-Python, disk-based inverted index. bm25s is the fast path — but it buys that speed by doing less.

Trade-offs

bm25s (semlix.bm25.create_bm25_index)

  • In-memory: the corpus must fit in RAM; the index is rebuilt on each commit() (great for batch/bulk loads, poor for frequent small updates).

  • Bag-of-words BM25 only: no phrase/proximity, positions, faceting, highlighting, spelling, or the full query language.

semlix core (semlix.index.create_in)

  • Disk-based, incremental, segment-merging — scales past RAM, supports live add/update/delete without a full rebuild.

  • Full query language, phrases/positions, faceting, collapsing, highlighting, spelling, spans, nested docs; pluggable analyzers, scoring and codecs.

When to use which

Use bm25s when you want maximum lexical throughput/latency, the corpus fits in memory, BM25 over words is enough (typical RAG / the lexical leg of hybrid search), and updates are batchy.

Use the semlix core when you need the rich query language, phrases/positions, facets, highlighting, spelling or custom analyzers/scoring, or when the index is larger than RAM or needs incremental live updates.

For hybrid / semantic search, HybridSearcher accepts either as its lexical index. The semlix.unified package wires bm25s + pgvector together for the fast-lexical + semantic combination.

Tip

For a fast lexical or hybrid service, default to bm25s; reach for the semlix core only when a feature it provides is actually required.