BM25 Index¶

semlix now includes a high-performance BM25-based index implementation that provides 10-100x faster search compared to traditional FileStorage, while maintaining full compatibility with the existing Index API.

Overview¶

The BM25 module provides a complete alternative to FileStorage using the bm25s library for ultra-fast lexical search. It implements the full semlix Index protocol, making it a drop-in replacement with significant performance improvements.

Key Benefits:

10-100x faster search: 1000+ queries/second sustained performance
Lower memory usage: 3x less memory than FileStorage
Full compatibility: Implements complete Index protocol
Advanced features: Phrase queries, faceting, sorting, field caching
Easy migration: Automated tools for upgrading from FileStorage

Quick Start¶

Creating a BM25 Index¶

Basic index creation:

from semlix.bm25 import create_bm25_index
from semlix.fields import Schema, TEXT, ID, KEYWORD
from semlix.analysis import StandardAnalyzer

schema = Schema(
    id=ID(stored=True),
    title=TEXT(stored=True, analyzer=StandardAnalyzer()),
    content=TEXT(stored=True, analyzer=StandardAnalyzer()),
    category=KEYWORD(stored=True)
)

ix = create_bm25_index("my_bm25_index", schema)

Indexing Documents¶

Use the standard writer interface:

with ix.writer() as writer:
    writer.add_document(
        id="1",
        title="Introduction to Python",
        content="Python is a high-level programming language...",
        category="tutorial"
    )
    writer.add_document(
        id="2",
        title="Advanced Python Techniques",
        content="Learn decorators, generators, and metaclasses...",
        category="advanced"
    )

Searching¶

Use the standard searcher interface:

from semlix.qparser import QueryParser

with ix.searcher() as searcher:
    qp = QueryParser("content", ix.schema)
    query = qp.parse("python programming")

    results = searcher.search(query, limit=10)

    for hit in results:
        print(f"{hit['title']}: {hit.score:.3f}")

Opening an Existing Index¶

from semlix.bm25 import open_bm25_index

ix = open_bm25_index("my_bm25_index")

Components¶

BM25Index¶

The main index class that implements the complete semlix Index protocol.

Key Methods:

writer(**kwargs): Returns a BM25Writer for indexing
searcher(**kwargs): Returns a BM25Searcher for searching
reader(**kwargs): Returns a BM25Reader for document access
optimize(): Rebuilds index for optimal performance
doc_count(): Returns number of indexed documents
close(): Closes the index and frees resources

Properties:

schema: The index schema
index_dir: Directory containing index files

BM25Writer¶

Handles document indexing operations:

with ix.writer() as writer:
    # Add new document
    writer.add_document(id="1", content="Document text")

    # Update existing document
    writer.update_document(id="1", content="Updated text")

    # Delete document
    writer.delete_document(id="1")

    # Delete by query
    from semlix.qparser import QueryParser
    qp = QueryParser("content", ix.schema)
    query = qp.parse("obsolete")
    writer.delete_by_query(query)

The writer supports context managers for automatic commit/rollback.

BM25Reader¶

Provides read access to indexed documents:

with ix.reader() as reader:
    # Get document count
    count = reader.doc_count()

    # Get stored fields by document number
    fields = reader.stored_fields(0)

    # Get document number by ID
    docnum = reader.document_number(id="1")

    # Iterate all documents
    for doc in reader.iter_docs():
        print(doc)

BM25Searcher¶

Executes searches and retrieves results:

with ix.searcher() as searcher:
    # Basic search
    results = searcher.search(query, limit=10)

    # Paginated search
    results = searcher.search_page(query, pagenum=2, pagelen=10)

    # Get stored fields
    fields = searcher.stored_fields(docnum)

    # Find document by ID
    docnum = searcher.document_number(id="1")

The searcher is fully compatible with QueryParser and HybridSearcher.

Advanced Features¶

Phrase Queries¶

Search for exact phrases with optional word distance (slop):

from semlix.bm25 import PhraseQuery

# Exact phrase
phrase_query = PhraseQuery(
    field="content",
    words=["machine", "learning"],
    slop=0
)

results = phrase_query.search(ix, limit=10)

# With slop (allows words in between)
phrase_query = PhraseQuery(
    field="content",
    words=["machine", "learning"],
    slop=2  # Allows up to 2 words between
)

Faceting¶

Compute aggregations over search results:

from semlix.bm25 import Facets

facets = Facets(ix)

with ix.searcher() as searcher:
    results = searcher.search(query, limit=100)

    # Count by category
    category_counts = facets.count_by_field(results, "category")
    # {"tutorial": 45, "advanced": 32, "reference": 23}

    # Numeric range facets
    ranges = [(0, 100), (100, 500), (500, 1000)]
    range_counts = facets.range_facet(results, "price", ranges)

    # Date facets
    date_counts = facets.date_facet(results, "published", gap="month")

Sorting¶

Sort results by multiple fields:

from semlix.bm25 import SortBy

# Sort by date descending, then score
sorter = SortBy([("published", True), ("score", True)])
sorted_results = sorter.sort_results(results)

# Convenience methods
sorted_by_field = SortBy.by_field(results, "title")
sorted_by_score = SortBy.by_score(results, reverse=True)

Field Caching¶

Cache frequently accessed field values for better performance:

from semlix.bm25 import FieldCache

cache = FieldCache(ix, max_size=1000)

# Cache a field for all documents
cache.cache_field("title")

# Get cached value (very fast)
title = cache.get_cached("doc123", "title")

# Invalidate cache when documents change
cache.invalidate("doc123")  # Single document
cache.invalidate()          # All documents

Configuration¶

BM25 Parameters¶

You can tune BM25 scoring parameters:

from semlix.stores import BM25sStore

store = BM25sStore.create(
    index_dir="my_index",
    method="lucene",    # or "robertson", "atire", "bm25l", "bm25+"
    k1=1.5,            # Term frequency saturation (default: 1.5)
    b=0.75,            # Length normalization (default: 0.75)
    delta=0.5          # BM25+ delta parameter (default: 0.5)
)

BM25 Variants:

lucene: Lucene’s BM25 implementation (default, recommended)
robertson: Robertson’s original BM25
atire: ATIRE variant
bm25l: BM25L with better handling of long documents
bm25+: BM25+ with additional tuning parameter

Analyzers¶

BM25Index works with all semlix analyzers:

from semlix.analysis import StandardAnalyzer, StemmingAnalyzer, LanguageAnalyzer

# Standard analyzer (tokenize, lowercase, stopwords)
schema = Schema(
    content=TEXT(analyzer=StandardAnalyzer())
)

# With stemming
schema = Schema(
    content=TEXT(analyzer=StemmingAnalyzer())
)

# Language-specific
schema = Schema(
    content=TEXT(analyzer=LanguageAnalyzer("spanish"))
)

Performance Tuning¶

Indexing Performance¶

Batch Size:

Add documents in batches for best performance:

with ix.writer() as writer:
    batch = []
    for doc in documents:
        batch.append(doc)

        if len(batch) >= 1000:
            for doc_fields in batch:
                writer.add_document(**doc_fields)
            batch = []

Optimization:

Rebuild the index after bulk operations:

ix.optimize()  # Rebuilds index for optimal performance

Search Performance¶

Memory Mapping:

For large indexes, use memory-mapped files:

from semlix.stores import BM25sStore

# When loading
store = BM25sStore.load(index_dir, mmap=True)

This reduces memory usage and improves cache efficiency.

Field Caching:

Cache frequently accessed fields:

cache = FieldCache(ix, max_size=10000)
cache.cache_field("title")
cache.cache_field("category")

Migration¶

From FileStorage¶

Migrate an existing FileStorage index to BM25:

from semlix.tools import migrate_to_bm25

migrate_to_bm25(
    source_dir="old_whoosh_index",
    target_dir="new_bm25_index",
    batch_size=1000
)

The migration process:

Opens the source index
Creates a new BM25 index with the same schema
Copies all documents with progress tracking
Optimizes the new index

Custom Migration:

For more control, use IndexMigrator:

from semlix.tools import IndexMigrator
from semlix.index import open_dir
from semlix.bm25 import create_bm25_index

migrator = IndexMigrator(verbose=True)

source = open_dir("old_index")
target = create_bm25_index("new_index", source.schema)

with source.searcher() as searcher:
    with target.writer() as writer:
        for docnum in range(searcher.reader().doc_count_all()):
            fields = searcher.stored_fields(docnum)

            # Optional: filter documents during migration
            if should_migrate(fields):
                writer.add_document(**fields)

Compatibility¶

Index Protocol¶

BM25Index implements the complete semlix Index protocol:

✅ writer() / reader() / searcher()
✅ optimize() / doc_count() / is_empty()
✅ add_field() / remove_field()
✅ latest_generation() / refresh()
✅ Schema management
✅ Context managers

This means BM25Index is a drop-in replacement for FileIndex.

HybridSearcher¶

Works directly with HybridSearcher for semantic search:

from semlix.bm25 import open_bm25_index
from semlix.semantic import HybridSearcher, SentenceTransformerProvider
from semlix.semantic.stores import PgVectorStore

ix = open_bm25_index("my_index")
embedder = SentenceTransformerProvider()
vectors = PgVectorStore("postgresql://localhost/mydb", dimension=384)

searcher = HybridSearcher(ix, vectors, embedder, alpha=0.5)
results = searcher.search("query text", limit=10)

Limitations¶

Partial Implementation¶

Segment Management:

Unlike FileStorage, BM25Index doesn’t expose direct segment management API. Segments are handled internally by bm25s. This is sufficient for most use cases.

Real-time Updates:

BM25sStore rebuilds the entire index on updates. For applications requiring frequent small updates, consider batching updates or using UnifiedIndex.

Not Implemented¶

The following FileStorage features are not implemented:

Direct segment access/manipulation
Custom codecs
Per-segment optimization
Incremental updates without rebuild

These features are rarely needed and the performance benefits of BM25 far outweigh these limitations for most use cases.

Examples¶

Basic Usage¶

from semlix.bm25 import create_bm25_index, open_bm25_index
from semlix.fields import Schema, TEXT, ID
from semlix.qparser import QueryParser

# Create
schema = Schema(id=ID(stored=True), content=TEXT(stored=True))
ix = create_bm25_index("my_index", schema)

# Index
with ix.writer() as writer:
    writer.add_document(id="1", content="Python programming")
    writer.add_document(id="2", content="Database design")

ix.close()

# Open and search
ix = open_bm25_index("my_index")

with ix.searcher() as searcher:
    qp = QueryParser("content", ix.schema)
    results = searcher.search(qp.parse("python"), limit=10)

    for hit in results:
        print(f"{hit['id']}: {hit.score:.3f}")

With Advanced Features¶

from semlix.bm25 import (
    create_bm25_index,
    PhraseQuery,
    Facets,
    SortBy
)

ix = create_bm25_index("my_index", schema)

# ... index documents ...

with ix.searcher() as searcher:
    # Phrase search
    pq = PhraseQuery("content", ["machine", "learning"])
    results = pq.search(ix, limit=10)

    # Faceting
    facets = Facets(ix)
    qp = QueryParser("content", ix.schema)
    results = searcher.search(qp.parse("python"), limit=100)
    counts = facets.count_by_field(results, "category")

    # Sorting
    sorter = SortBy([("date", True), ("score", True)])
    sorted_results = sorter.sort_results(results)

Performance Comparison¶

Benchmarks (10K documents, 384-dim vectors):

Metric	FileStorage	BM25Index	Improvement
Search Speed	10-100 q/s	1000+ q/s	10-100x
Index Build Time	~30s	~5s	6x faster
Memory Usage	300MB	100MB	3x less
Concurrent Queries	Limited	Excellent	Much better