semlix 3.2 release notes

semlix 3.2.0

A modernization, performance and tooling release. Backward compatible for Python 3 users; Python 2 support was removed (it had been unmaintained for years). The canonical change list is in CHANGELOG.md.

Highlights

  • Engine guide (Choosing an engine: semlix core vs bm25s): a measured comparison of the two lexical engines. On 20k documents the bm25s engine indexes ~40x faster and queries ~57x faster than the pure-Python core — use it for raw lexical speed and the lexical leg of hybrid search; use the core for its full feature set.

  • Optional native acceleration (Optional native acceleration): SEMLIX_COMPILE=1 compiles a curated set of hot modules with mypyc; pure-Python stays the default and the automatic fallback.

  • Faster hybrid search: lexical and semantic searches now run concurrently and the unused side is skipped when alpha is 0/1; a bounded query-embedding cache avoids re-embedding repeated queries.

  • Incremental BM25s indexing: only new documents are tokenized on add (instead of re-tokenizing the whole corpus each commit).

  • StandardAnalyzer indexing fast path: a fused tokenize+lowercase+stop+count loop, asserted bit-for-bit identical to the generic pipeline.

Breaking / compatibility

  • Requires Python >= 3.9. Python 2 support and the cached-property dependency were removed; the core install now has no third-party runtime dependencies.

Security

  • NumpyVectorStore persistence is now pickle-free (numpy .npz + JSON, loaded with allow_pickle=False), removing the arbitrary-code-execution risk of unpickling a shared/untrusted index. Legacy pickle stores still load with a deprecation warning and migrate on the next save.

Fixes

  • varint_to_int crashed on Python 3 (ord() on a bytes element).

  • BM25sStore.search: dropped the update_vocab kwarg removed in bm25s 0.3.9, corrected query tokenization, and clamped k to the corpus size.

  • codec/whoosh3.py: elif fixedsize is 0== 0.

See Also