semlix 3.1 release notes¶
semlix 3.1.0¶
This release adds high-performance BM25 search capabilities and unified hybrid search combining lexical and semantic search in a single index. These features provide 10-100x performance improvements over traditional FileStorage while maintaining full backward compatibility.
Major Changes¶
BM25 Index: Ultra-fast BM25-based search implementation using the
bm25slibrary, providing 1000+ queries/second sustained performance with 3x less memory usage than FileStorage.Unified Index: Combined BM25 + vector search in a single index with transactional writes and automatic embedding generation.
Migration Tools: Automated migration from FileStorage to BM25Index or UnifiedIndex with progress tracking and verification.
Advanced Features: Phrase queries, faceting, multi-field sorting, and field caching for enhanced search capabilities.
New Features¶
BM25 Index¶
semlix.bm25.BM25Index: High-performance index implementing the complete semlix Index protocol as a drop-in replacement for FileIndex.semlix.bm25.BM25Writer: Fast document indexing with transactional writes and context manager support.semlix.bm25.BM25Reader: Efficient document access with iteration and lookup capabilities.semlix.bm25.BM25Searcher: Ultra-fast searcher compatible with HybridSearcher and query parsers.semlix.stores.BM25sStore: Low-level storage layer using bm25s library with full Whoosh analyzer integration.
Advanced Search Features¶
semlix.bm25.PhraseQuery: Exact phrase matching with configurable slop (word distance) support.semlix.bm25.Facets: Powerful faceting capabilities including:count_by_field(): Count results by field valuesrange_facet(): Numeric range aggregationsdate_facet(): Date/time based faceting
semlix.bm25.SortBy: Multi-field sorting with custom ordering:by_field(): Sort by single fieldby_score(): Sort by relevance scoresort_results(): Multi-field custom sorting
semlix.bm25.FieldCache: LRU cache for frequently accessed field values with automatic invalidation.
Unified Index¶
semlix.unified.UnifiedIndex: Combines BM25Index and PgVectorStore for unified hybrid search with:Automatic embedding generation during indexing
Transactional writes across both lexical and semantic stores
Single-index management interface
ACID guarantees for data consistency
semlix.unified.UnifiedWriter: Transactional writer that maintains both BM25 and vector indexes in sync with automatic rollback on failure.semlix.unified.UnifiedSearcher: Enhanced hybrid searcher with:hybrid_search(): Combined BM25 + vector searchsearch_with_facets(): Hybrid search with facetingphrase_search(): Phrase queries on hybrid resultssearch_sorted(): Sorted hybrid search
Migration Tools¶
semlix.tools.IndexMigrator: Complete migration toolkit for upgrading from FileStorage to BM25 or Unified indexes.semlix.tools.migrate_to_bm25(): Simple function for migrating FileStorage to BM25Index with automatic schema preservation and progress tracking.semlix.tools.migrate_to_unified(): Migrate FileStorage plus vectors to UnifiedIndex with embedding generation and vector store migration.semlix.tools.migrate_vectors_only(): Migrate from NumpyVectorStore or FaissVectorStore to PgVectorStore.
PostgreSQL Vector Store¶
semlix.semantic.stores.PgVectorStore: Production-ready vector store using PostgreSQL with pgvector extension, featuring:HNSW and IVFFlat indexing for fast similarity search
Connection pooling for concurrent access
JSONB metadata filtering
Multiple distance metrics (cosine, L2, inner product)
Transactional integrity with ACID guarantees
Performance Improvements¶
BM25 Index Performance¶
Compared to FileStorage (10K documents, 384-dim vectors):
Search Speed: 1000+ queries/second (10-100x faster)
Indexing Speed: ~2000 docs/second (6x faster)
Memory Usage: ~100MB (3x less)
Concurrent Queries: Excellent scaling with multi-threading
Configuration Options¶
BM25 scoring variants available:
lucene: Lucene’s BM25 implementation (default, recommended)robertson: Robertson’s original BM25 formulaatire: ATIRE search engine variantbm25l: BM25L with better handling of long documentsbm25+: BM25+ with additional tuning parameter
Tunable parameters:
k1: Term frequency saturation parameter (default: 1.5)b: Length normalization parameter (default: 0.75)delta: BM25+ delta parameter (default: 0.5)
API Changes¶
No breaking changes. All new functionality is additive and opt-in.
New Module Structure¶
semlix.stores: Low-level storage implementationssemlix.stores.BM25sStore: BM25 storage layer
semlix.bm25: High-performance BM25 indexsemlix.bm25.BM25Index: Main index classsemlix.bm25.BM25Writer: Document writersemlix.bm25.BM25Reader: Document readersemlix.bm25.BM25Searcher: Search interfaceAdvanced features: PhraseQuery, Facets, SortBy, FieldCache
semlix.unified: Unified hybrid search indexsemlix.unified.UnifiedIndex: Combined BM25 + vector indexsemlix.unified.UnifiedWriter: Transactional writersemlix.unified.UnifiedSearcher: Enhanced hybrid searcher
semlix.tools: Migration and utility toolssemlix.tools.IndexMigrator: Migration toolkitMigration helper functions
Documentation¶
New documentation section BM25 Index covering:
Quick start guide
Complete component reference
Advanced features usage
Configuration and tuning
Performance optimization
Migration from FileStorage
Compatibility information
New documentation section Unified Index covering:
Unified index architecture
Setup and prerequisites
Search modes (hybrid, lexical-only, semantic-only)
Transactional writes
Advanced features integration
Configuration options
Performance characteristics
New documentation section Migration Guide covering:
All migration scenarios
Step-by-step migration guides
Zero-downtime migration strategies
Incremental migration for large indexes
Testing and verification procedures
Data integrity checks
Common issues and solutions
Installation¶
Core BM25 support:
pip install bm25s PyStemmer
For unified index with semantic search:
pip install bm25s PyStemmer sentence-transformers psycopg2-binary pgvector
PostgreSQL setup for unified index:
# Using Docker (recommended)
docker run -d --name pgvector \
-e POSTGRES_PASSWORD=password \
-p 5432:5432 \
ankane/pgvector
# Create extension
psql -d mydb -c "CREATE EXTENSION vector;"
Compatibility¶
Fully backward compatible: All existing code continues to work without modification.
Drop-in replacement: BM25Index implements the complete Index protocol and can be used anywhere FileIndex is used.
Analyzer support: Works with all Whoosh analyzers including StandardAnalyzer, StemmingAnalyzer, and LanguageAnalyzer.
HybridSearcher compatible: BM25Index and UnifiedIndex work directly with the existing HybridSearcher from semlix 3.0.
Schema compatibility: All standard semlix field types are supported (ID, TEXT, KEYWORD, NUMERIC, DATETIME, BOOLEAN).
Limitations¶
Segment management: BM25Index doesn’t expose direct segment manipulation API. Segments are handled internally by bm25s, which is sufficient for most use cases.
Real-time updates: BM25sStore rebuilds the index on updates. For applications requiring very frequent small updates, consider batching or using UnifiedIndex.
Custom codecs: FileStorage custom codecs are not supported in BM25Index. Standard field types cover the vast majority of use cases.
Migration Guide¶
From FileStorage to BM25Index¶
Basic migration:
from semlix.tools import migrate_to_bm25
migrate_to_bm25(
source_dir="old_whoosh_index",
target_dir="new_bm25_index",
batch_size=1000,
verbose=True
)
From FileStorage to UnifiedIndex¶
Migrate to unified hybrid search:
from semlix.tools import migrate_to_unified
from semlix.semantic import SentenceTransformerProvider
embedder = SentenceTransformerProvider("all-MiniLM-L6-v2")
migrate_to_unified(
source_dir="old_index",
target_dir="new_unified_index",
connection_string="postgresql://localhost/mydb",
embedder=embedder,
vector_store_path="old_vectors.pkl", # Optional
batch_size=100,
verbose=True
)
Verification after migration:
from semlix.index import open_dir
from semlix.bm25 import open_bm25_index
old_ix = open_dir("old_index")
new_ix = open_bm25_index("new_bm25_index")
# Verify document counts match
assert old_ix.doc_count() == new_ix.doc_count()
# Verify schema is preserved
assert old_ix.schema == new_ix.schema
Usage Examples¶
Basic BM25 Index¶
from semlix.bm25 import create_bm25_index
from semlix.fields import Schema, TEXT, ID
from semlix.qparser import QueryParser
# Create index
schema = Schema(id=ID(stored=True), content=TEXT(stored=True))
ix = create_bm25_index("my_index", schema)
# Index documents
with ix.writer() as writer:
writer.add_document(id="1", content="Python programming")
writer.add_document(id="2", content="Database design")
# Search
with ix.searcher() as searcher:
qp = QueryParser("content", ix.schema)
results = searcher.search(qp.parse("python"), limit=10)
for hit in results:
print(f"{hit['id']}: {hit.score:.3f}")
Phrase Queries¶
from semlix.bm25 import PhraseQuery
# Exact phrase
phrase = PhraseQuery("content", ["machine", "learning"], slop=0)
results = phrase.search(ix, limit=10)
# With slop (allow words in between)
phrase = PhraseQuery("content", ["machine", "learning"], slop=2)
results = phrase.search(ix, limit=10)
Faceting¶
from semlix.bm25 import Facets
facets = Facets(ix)
with ix.searcher() as searcher:
results = searcher.search(query, limit=100)
# Count by category
category_counts = facets.count_by_field(results, "category")
# Numeric ranges
ranges = [(0, 100), (100, 500), (500, 1000)]
price_counts = facets.range_facet(results, "price", ranges)
# Date facets
date_counts = facets.date_facet(results, "published", gap="month")
Unified Index (Hybrid Search)¶
from semlix.unified import create_unified_index
from semlix.semantic import SentenceTransformerProvider
embedder = SentenceTransformerProvider("all-MiniLM-L6-v2")
ix = create_unified_index(
"my_index",
schema,
"postgresql://localhost/mydb",
embedder
)
# Index with automatic embeddings
with ix.writer() as writer:
writer.add_document(id="1", content="Python programming")
# Hybrid search (BM25 + vector)
with ix.searcher() as searcher:
results = searcher.hybrid_search("python tutorial", alpha=0.5)
# With facets
results, facets = searcher.search_with_facets(
"python",
facet_fields=["category"]
)
# Phrase search
results = searcher.phrase_search("content", "machine learning")
Internal Changes¶
New storage abstraction layer in
semlix.storesfor pluggable backends.BM25 implementation using
bm25slibrary for optimal performance.Enhanced index protocol with transactional guarantees in UnifiedIndex.
Migration tools with comprehensive error handling and verification.
Connection pooling in PgVectorStore for concurrent access.
LRU caching for field access optimization.
Dependencies¶
New optional dependencies:
bm25s: Required for BM25Index (pip install bm25s)PyStemmer: Required for stemming support (pip install PyStemmer)psycopg2-binary: Required for PgVectorStore (pip install psycopg2-binary)pgvector: Required for PgVectorStore (pip install pgvector)
Performance Notes¶
BM25Index is optimized for:
Read-heavy workloads: 1000+ queries/second sustained
Batch indexing: 2000+ documents/second
Memory efficiency: 3x less memory than FileStorage
Concurrent queries: Excellent multi-threading performance
UnifiedIndex is ideal for:
Hybrid search applications: Single index for lexical + semantic
Transactional integrity: ACID guarantees across both stores
Production deployments: PostgreSQL reliability and scaling
Advanced features: Faceting, sorting, phrase queries on hybrid results
For best performance:
Use batch indexing with 1000+ documents per commit
Call
optimize()after bulk indexing operationsUse field caching for frequently accessed fields
Enable PostgreSQL HNSW indexing for vector search
Use appropriate
alphaparameter for hybrid search (0.5 recommended)
Future Plans¶
Extended test coverage with performance benchmarks
Additional BM25 variants and tuning options
Enhanced migration tools with parallel processing
Production deployment guides and monitoring tools
High availability and replication support
See Also¶
BM25 Index - Complete BM25Index documentation
Unified Index - UnifiedIndex and hybrid search
Migration Guide - Detailed migration guides
Semantic Search - Semantic search overview from 3.0