BM25 Index¶
semlix now includes a high-performance BM25-based index implementation that provides 10-100x faster search compared to traditional FileStorage, while maintaining full compatibility with the existing Index API.
Overview¶
The BM25 module provides a complete alternative to FileStorage using the bm25s library for ultra-fast lexical search. It implements the full semlix Index protocol, making it a drop-in replacement with significant performance improvements.
Key Benefits:
10-100x faster search: 1000+ queries/second sustained performance
Lower memory usage: 3x less memory than FileStorage
Full compatibility: Implements complete Index protocol
Advanced features: Phrase queries, faceting, sorting, field caching
Easy migration: Automated tools for upgrading from FileStorage
Quick Start¶
Creating a BM25 Index¶
Basic index creation:
from semlix.bm25 import create_bm25_index
from semlix.fields import Schema, TEXT, ID, KEYWORD
from semlix.analysis import StandardAnalyzer
schema = Schema(
id=ID(stored=True),
title=TEXT(stored=True, analyzer=StandardAnalyzer()),
content=TEXT(stored=True, analyzer=StandardAnalyzer()),
category=KEYWORD(stored=True)
)
ix = create_bm25_index("my_bm25_index", schema)
Indexing Documents¶
Use the standard writer interface:
with ix.writer() as writer:
writer.add_document(
id="1",
title="Introduction to Python",
content="Python is a high-level programming language...",
category="tutorial"
)
writer.add_document(
id="2",
title="Advanced Python Techniques",
content="Learn decorators, generators, and metaclasses...",
category="advanced"
)
Searching¶
Use the standard searcher interface:
from semlix.qparser import QueryParser
with ix.searcher() as searcher:
qp = QueryParser("content", ix.schema)
query = qp.parse("python programming")
results = searcher.search(query, limit=10)
for hit in results:
print(f"{hit['title']}: {hit.score:.3f}")
Opening an Existing Index¶
from semlix.bm25 import open_bm25_index
ix = open_bm25_index("my_bm25_index")
Components¶
BM25Index¶
The main index class that implements the complete semlix Index protocol.
Key Methods:
writer(**kwargs): Returns a BM25Writer for indexingsearcher(**kwargs): Returns a BM25Searcher for searchingreader(**kwargs): Returns a BM25Reader for document accessoptimize(): Rebuilds index for optimal performancedoc_count(): Returns number of indexed documentsclose(): Closes the index and frees resources
Properties:
schema: The index schemaindex_dir: Directory containing index files
BM25Writer¶
Handles document indexing operations:
with ix.writer() as writer:
# Add new document
writer.add_document(id="1", content="Document text")
# Update existing document
writer.update_document(id="1", content="Updated text")
# Delete document
writer.delete_document(id="1")
# Delete by query
from semlix.qparser import QueryParser
qp = QueryParser("content", ix.schema)
query = qp.parse("obsolete")
writer.delete_by_query(query)
The writer supports context managers for automatic commit/rollback.
BM25Reader¶
Provides read access to indexed documents:
with ix.reader() as reader:
# Get document count
count = reader.doc_count()
# Get stored fields by document number
fields = reader.stored_fields(0)
# Get document number by ID
docnum = reader.document_number(id="1")
# Iterate all documents
for doc in reader.iter_docs():
print(doc)
BM25Searcher¶
Executes searches and retrieves results:
with ix.searcher() as searcher:
# Basic search
results = searcher.search(query, limit=10)
# Paginated search
results = searcher.search_page(query, pagenum=2, pagelen=10)
# Get stored fields
fields = searcher.stored_fields(docnum)
# Find document by ID
docnum = searcher.document_number(id="1")
The searcher is fully compatible with QueryParser and HybridSearcher.
Advanced Features¶
Phrase Queries¶
Search for exact phrases with optional word distance (slop):
from semlix.bm25 import PhraseQuery
# Exact phrase
phrase_query = PhraseQuery(
field="content",
words=["machine", "learning"],
slop=0
)
results = phrase_query.search(ix, limit=10)
# With slop (allows words in between)
phrase_query = PhraseQuery(
field="content",
words=["machine", "learning"],
slop=2 # Allows up to 2 words between
)
Faceting¶
Compute aggregations over search results:
from semlix.bm25 import Facets
facets = Facets(ix)
with ix.searcher() as searcher:
results = searcher.search(query, limit=100)
# Count by category
category_counts = facets.count_by_field(results, "category")
# {"tutorial": 45, "advanced": 32, "reference": 23}
# Numeric range facets
ranges = [(0, 100), (100, 500), (500, 1000)]
range_counts = facets.range_facet(results, "price", ranges)
# Date facets
date_counts = facets.date_facet(results, "published", gap="month")
Sorting¶
Sort results by multiple fields:
from semlix.bm25 import SortBy
# Sort by date descending, then score
sorter = SortBy([("published", True), ("score", True)])
sorted_results = sorter.sort_results(results)
# Convenience methods
sorted_by_field = SortBy.by_field(results, "title")
sorted_by_score = SortBy.by_score(results, reverse=True)
Field Caching¶
Cache frequently accessed field values for better performance:
from semlix.bm25 import FieldCache
cache = FieldCache(ix, max_size=1000)
# Cache a field for all documents
cache.cache_field("title")
# Get cached value (very fast)
title = cache.get_cached("doc123", "title")
# Invalidate cache when documents change
cache.invalidate("doc123") # Single document
cache.invalidate() # All documents
Configuration¶
BM25 Parameters¶
You can tune BM25 scoring parameters:
from semlix.stores import BM25sStore
store = BM25sStore.create(
index_dir="my_index",
method="lucene", # or "robertson", "atire", "bm25l", "bm25+"
k1=1.5, # Term frequency saturation (default: 1.5)
b=0.75, # Length normalization (default: 0.75)
delta=0.5 # BM25+ delta parameter (default: 0.5)
)
BM25 Variants:
lucene: Lucene’s BM25 implementation (default, recommended)robertson: Robertson’s original BM25atire: ATIRE variantbm25l: BM25L with better handling of long documentsbm25+: BM25+ with additional tuning parameter
Analyzers¶
BM25Index works with all semlix analyzers:
from semlix.analysis import StandardAnalyzer, StemmingAnalyzer, LanguageAnalyzer
# Standard analyzer (tokenize, lowercase, stopwords)
schema = Schema(
content=TEXT(analyzer=StandardAnalyzer())
)
# With stemming
schema = Schema(
content=TEXT(analyzer=StemmingAnalyzer())
)
# Language-specific
schema = Schema(
content=TEXT(analyzer=LanguageAnalyzer("spanish"))
)
Performance Tuning¶
Indexing Performance¶
Batch Size:
Add documents in batches for best performance:
with ix.writer() as writer:
batch = []
for doc in documents:
batch.append(doc)
if len(batch) >= 1000:
for doc_fields in batch:
writer.add_document(**doc_fields)
batch = []
Optimization:
Rebuild the index after bulk operations:
ix.optimize() # Rebuilds index for optimal performance
Search Performance¶
Memory Mapping:
For large indexes, use memory-mapped files:
from semlix.stores import BM25sStore
# When loading
store = BM25sStore.load(index_dir, mmap=True)
This reduces memory usage and improves cache efficiency.
Field Caching:
Cache frequently accessed fields:
cache = FieldCache(ix, max_size=10000)
cache.cache_field("title")
cache.cache_field("category")
Migration¶
From FileStorage¶
Migrate an existing FileStorage index to BM25:
from semlix.tools import migrate_to_bm25
migrate_to_bm25(
source_dir="old_whoosh_index",
target_dir="new_bm25_index",
batch_size=1000
)
The migration process:
Opens the source index
Creates a new BM25 index with the same schema
Copies all documents with progress tracking
Optimizes the new index
Custom Migration:
For more control, use IndexMigrator:
from semlix.tools import IndexMigrator
from semlix.index import open_dir
from semlix.bm25 import create_bm25_index
migrator = IndexMigrator(verbose=True)
source = open_dir("old_index")
target = create_bm25_index("new_index", source.schema)
with source.searcher() as searcher:
with target.writer() as writer:
for docnum in range(searcher.reader().doc_count_all()):
fields = searcher.stored_fields(docnum)
# Optional: filter documents during migration
if should_migrate(fields):
writer.add_document(**fields)
Compatibility¶
Index Protocol¶
BM25Index implements the complete semlix Index protocol:
✅
writer()/reader()/searcher()✅
optimize()/doc_count()/is_empty()✅
add_field()/remove_field()✅
latest_generation()/refresh()✅ Schema management
✅ Context managers
This means BM25Index is a drop-in replacement for FileIndex.
HybridSearcher¶
Works directly with HybridSearcher for semantic search:
from semlix.bm25 import open_bm25_index
from semlix.semantic import HybridSearcher, SentenceTransformerProvider
from semlix.semantic.stores import PgVectorStore
ix = open_bm25_index("my_index")
embedder = SentenceTransformerProvider()
vectors = PgVectorStore("postgresql://localhost/mydb", dimension=384)
searcher = HybridSearcher(ix, vectors, embedder, alpha=0.5)
results = searcher.search("query text", limit=10)
Limitations¶
Partial Implementation¶
Segment Management:
Unlike FileStorage, BM25Index doesn’t expose direct segment management API. Segments are handled internally by bm25s. This is sufficient for most use cases.
Real-time Updates:
BM25sStore rebuilds the entire index on updates. For applications requiring frequent small updates, consider batching updates or using UnifiedIndex.
Not Implemented¶
The following FileStorage features are not implemented:
Direct segment access/manipulation
Custom codecs
Per-segment optimization
Incremental updates without rebuild
These features are rarely needed and the performance benefits of BM25 far outweigh these limitations for most use cases.
Examples¶
Basic Usage¶
from semlix.bm25 import create_bm25_index, open_bm25_index
from semlix.fields import Schema, TEXT, ID
from semlix.qparser import QueryParser
# Create
schema = Schema(id=ID(stored=True), content=TEXT(stored=True))
ix = create_bm25_index("my_index", schema)
# Index
with ix.writer() as writer:
writer.add_document(id="1", content="Python programming")
writer.add_document(id="2", content="Database design")
ix.close()
# Open and search
ix = open_bm25_index("my_index")
with ix.searcher() as searcher:
qp = QueryParser("content", ix.schema)
results = searcher.search(qp.parse("python"), limit=10)
for hit in results:
print(f"{hit['id']}: {hit.score:.3f}")
With Advanced Features¶
from semlix.bm25 import (
create_bm25_index,
PhraseQuery,
Facets,
SortBy
)
ix = create_bm25_index("my_index", schema)
# ... index documents ...
with ix.searcher() as searcher:
# Phrase search
pq = PhraseQuery("content", ["machine", "learning"])
results = pq.search(ix, limit=10)
# Faceting
facets = Facets(ix)
qp = QueryParser("content", ix.schema)
results = searcher.search(qp.parse("python"), limit=100)
counts = facets.count_by_field(results, "category")
# Sorting
sorter = SortBy([("date", True), ("score", True)])
sorted_results = sorter.sort_results(results)
Performance Comparison¶
Benchmarks (10K documents, 384-dim vectors):
Metric |
FileStorage |
BM25Index |
Improvement |
|---|---|---|---|
Search Speed |
10-100 q/s |
1000+ q/s |
10-100x |
Index Build Time |
~30s |
~5s |
6x faster |
Memory Usage |
300MB |
100MB |
3x less |
Concurrent Queries |
Limited |
Excellent |
Much better |
See Also¶
Unified Index - Unified index combining BM25 and vector search
Semantic Search - Semantic search and HybridSearcher
How to index documents - General indexing concepts
How to search - Search query syntax