Choosing an engine: semlix core vs bm25s¶
semlix ships two lexical engines behind the same Index protocol. The
semantic / hybrid layer rides on either, so this choice is about the lexical
half of search.
Benchmark¶
Measured with benchmark/bench_engines.py on 20,000 documents (needs
pip install bm25s PyStemmer):
engine |
indexing |
query p50 |
query mean |
|---|---|---|---|
semlix core ( |
~1,400 docs/s |
~4.5 ms |
~4.5 ms |
bm25s ( |
~56,000 docs/s |
~0.08 ms |
~0.08 ms |
ratio |
~40x faster |
~57x faster |
– |
The bm25s engine scores BM25 over an in-memory matrix with native (numpy) code; the semlix core is a pure-Python, disk-based inverted index. bm25s is the fast path — but it buys that speed by doing less.
Trade-offs¶
bm25s (semlix.bm25.create_bm25_index)
In-memory: the corpus must fit in RAM; the index is rebuilt on each
commit()(great for batch/bulk loads, poor for frequent small updates).Bag-of-words BM25 only: no phrase/proximity, positions, faceting, highlighting, spelling, or the full query language.
semlix core (semlix.index.create_in)
Disk-based, incremental, segment-merging — scales past RAM, supports live add/update/delete without a full rebuild.
Full query language, phrases/positions, faceting, collapsing, highlighting, spelling, spans, nested docs; pluggable analyzers, scoring and codecs.
When to use which¶
Use bm25s when you want maximum lexical throughput/latency, the corpus fits in memory, BM25 over words is enough (typical RAG / the lexical leg of hybrid search), and updates are batchy.
Use the semlix core when you need the rich query language, phrases/positions, facets, highlighting, spelling or custom analyzers/scoring, or when the index is larger than RAM or needs incremental live updates.
For hybrid / semantic search, HybridSearcher
accepts either as its lexical index. The semlix.unified package wires
bm25s + pgvector together for the fast-lexical + semantic combination.
Tip
For a fast lexical or hybrid service, default to bm25s; reach for the semlix core only when a feature it provides is actually required.