BM25 Index

semlix now includes a high-performance BM25-based index implementation that provides 10-100x faster search compared to traditional FileStorage, while maintaining full compatibility with the existing Index API.

Overview

The BM25 module provides a complete alternative to FileStorage using the bm25s library for ultra-fast lexical search. It implements the full semlix Index protocol, making it a drop-in replacement with significant performance improvements.

Key Benefits:

  • 10-100x faster search: 1000+ queries/second sustained performance

  • Lower memory usage: 3x less memory than FileStorage

  • Full compatibility: Implements complete Index protocol

  • Advanced features: Phrase queries, faceting, sorting, field caching

  • Easy migration: Automated tools for upgrading from FileStorage

Quick Start

Creating a BM25 Index

Basic index creation:

from semlix.bm25 import create_bm25_index
from semlix.fields import Schema, TEXT, ID, KEYWORD
from semlix.analysis import StandardAnalyzer

schema = Schema(
    id=ID(stored=True),
    title=TEXT(stored=True, analyzer=StandardAnalyzer()),
    content=TEXT(stored=True, analyzer=StandardAnalyzer()),
    category=KEYWORD(stored=True)
)

ix = create_bm25_index("my_bm25_index", schema)

Indexing Documents

Use the standard writer interface:

with ix.writer() as writer:
    writer.add_document(
        id="1",
        title="Introduction to Python",
        content="Python is a high-level programming language...",
        category="tutorial"
    )
    writer.add_document(
        id="2",
        title="Advanced Python Techniques",
        content="Learn decorators, generators, and metaclasses...",
        category="advanced"
    )

Searching

Use the standard searcher interface:

from semlix.qparser import QueryParser

with ix.searcher() as searcher:
    qp = QueryParser("content", ix.schema)
    query = qp.parse("python programming")

    results = searcher.search(query, limit=10)

    for hit in results:
        print(f"{hit['title']}: {hit.score:.3f}")

Opening an Existing Index

from semlix.bm25 import open_bm25_index

ix = open_bm25_index("my_bm25_index")

Components

BM25Index

The main index class that implements the complete semlix Index protocol.

Key Methods:

  • writer(**kwargs): Returns a BM25Writer for indexing

  • searcher(**kwargs): Returns a BM25Searcher for searching

  • reader(**kwargs): Returns a BM25Reader for document access

  • optimize(): Rebuilds index for optimal performance

  • doc_count(): Returns number of indexed documents

  • close(): Closes the index and frees resources

Properties:

  • schema: The index schema

  • index_dir: Directory containing index files

BM25Writer

Handles document indexing operations:

with ix.writer() as writer:
    # Add new document
    writer.add_document(id="1", content="Document text")

    # Update existing document
    writer.update_document(id="1", content="Updated text")

    # Delete document
    writer.delete_document(id="1")

    # Delete by query
    from semlix.qparser import QueryParser
    qp = QueryParser("content", ix.schema)
    query = qp.parse("obsolete")
    writer.delete_by_query(query)

The writer supports context managers for automatic commit/rollback.

BM25Reader

Provides read access to indexed documents:

with ix.reader() as reader:
    # Get document count
    count = reader.doc_count()

    # Get stored fields by document number
    fields = reader.stored_fields(0)

    # Get document number by ID
    docnum = reader.document_number(id="1")

    # Iterate all documents
    for doc in reader.iter_docs():
        print(doc)

BM25Searcher

Executes searches and retrieves results:

with ix.searcher() as searcher:
    # Basic search
    results = searcher.search(query, limit=10)

    # Paginated search
    results = searcher.search_page(query, pagenum=2, pagelen=10)

    # Get stored fields
    fields = searcher.stored_fields(docnum)

    # Find document by ID
    docnum = searcher.document_number(id="1")

The searcher is fully compatible with QueryParser and HybridSearcher.

Advanced Features

Phrase Queries

Search for exact phrases with optional word distance (slop):

from semlix.bm25 import PhraseQuery

# Exact phrase
phrase_query = PhraseQuery(
    field="content",
    words=["machine", "learning"],
    slop=0
)

results = phrase_query.search(ix, limit=10)

# With slop (allows words in between)
phrase_query = PhraseQuery(
    field="content",
    words=["machine", "learning"],
    slop=2  # Allows up to 2 words between
)

Faceting

Compute aggregations over search results:

from semlix.bm25 import Facets

facets = Facets(ix)

with ix.searcher() as searcher:
    results = searcher.search(query, limit=100)

    # Count by category
    category_counts = facets.count_by_field(results, "category")
    # {"tutorial": 45, "advanced": 32, "reference": 23}

    # Numeric range facets
    ranges = [(0, 100), (100, 500), (500, 1000)]
    range_counts = facets.range_facet(results, "price", ranges)

    # Date facets
    date_counts = facets.date_facet(results, "published", gap="month")

Sorting

Sort results by multiple fields:

from semlix.bm25 import SortBy

# Sort by date descending, then score
sorter = SortBy([("published", True), ("score", True)])
sorted_results = sorter.sort_results(results)

# Convenience methods
sorted_by_field = SortBy.by_field(results, "title")
sorted_by_score = SortBy.by_score(results, reverse=True)

Field Caching

Cache frequently accessed field values for better performance:

from semlix.bm25 import FieldCache

cache = FieldCache(ix, max_size=1000)

# Cache a field for all documents
cache.cache_field("title")

# Get cached value (very fast)
title = cache.get_cached("doc123", "title")

# Invalidate cache when documents change
cache.invalidate("doc123")  # Single document
cache.invalidate()          # All documents

Configuration

BM25 Parameters

You can tune BM25 scoring parameters:

from semlix.stores import BM25sStore

store = BM25sStore.create(
    index_dir="my_index",
    method="lucene",    # or "robertson", "atire", "bm25l", "bm25+"
    k1=1.5,            # Term frequency saturation (default: 1.5)
    b=0.75,            # Length normalization (default: 0.75)
    delta=0.5          # BM25+ delta parameter (default: 0.5)
)

BM25 Variants:

  • lucene: Lucene’s BM25 implementation (default, recommended)

  • robertson: Robertson’s original BM25

  • atire: ATIRE variant

  • bm25l: BM25L with better handling of long documents

  • bm25+: BM25+ with additional tuning parameter

Analyzers

BM25Index works with all semlix analyzers:

from semlix.analysis import StandardAnalyzer, StemmingAnalyzer, LanguageAnalyzer

# Standard analyzer (tokenize, lowercase, stopwords)
schema = Schema(
    content=TEXT(analyzer=StandardAnalyzer())
)

# With stemming
schema = Schema(
    content=TEXT(analyzer=StemmingAnalyzer())
)

# Language-specific
schema = Schema(
    content=TEXT(analyzer=LanguageAnalyzer("spanish"))
)

Performance Tuning

Indexing Performance

Batch Size:

Add documents in batches for best performance:

with ix.writer() as writer:
    batch = []
    for doc in documents:
        batch.append(doc)

        if len(batch) >= 1000:
            for doc_fields in batch:
                writer.add_document(**doc_fields)
            batch = []

Optimization:

Rebuild the index after bulk operations:

ix.optimize()  # Rebuilds index for optimal performance

Search Performance

Memory Mapping:

For large indexes, use memory-mapped files:

from semlix.stores import BM25sStore

# When loading
store = BM25sStore.load(index_dir, mmap=True)

This reduces memory usage and improves cache efficiency.

Field Caching:

Cache frequently accessed fields:

cache = FieldCache(ix, max_size=10000)
cache.cache_field("title")
cache.cache_field("category")

Migration

From FileStorage

Migrate an existing FileStorage index to BM25:

from semlix.tools import migrate_to_bm25

migrate_to_bm25(
    source_dir="old_whoosh_index",
    target_dir="new_bm25_index",
    batch_size=1000
)

The migration process:

  1. Opens the source index

  2. Creates a new BM25 index with the same schema

  3. Copies all documents with progress tracking

  4. Optimizes the new index

Custom Migration:

For more control, use IndexMigrator:

from semlix.tools import IndexMigrator
from semlix.index import open_dir
from semlix.bm25 import create_bm25_index

migrator = IndexMigrator(verbose=True)

source = open_dir("old_index")
target = create_bm25_index("new_index", source.schema)

with source.searcher() as searcher:
    with target.writer() as writer:
        for docnum in range(searcher.reader().doc_count_all()):
            fields = searcher.stored_fields(docnum)

            # Optional: filter documents during migration
            if should_migrate(fields):
                writer.add_document(**fields)

Compatibility

Index Protocol

BM25Index implements the complete semlix Index protocol:

  • writer() / reader() / searcher()

  • optimize() / doc_count() / is_empty()

  • add_field() / remove_field()

  • latest_generation() / refresh()

  • ✅ Schema management

  • ✅ Context managers

This means BM25Index is a drop-in replacement for FileIndex.

HybridSearcher

Works directly with HybridSearcher for semantic search:

from semlix.bm25 import open_bm25_index
from semlix.semantic import HybridSearcher, SentenceTransformerProvider
from semlix.semantic.stores import PgVectorStore

ix = open_bm25_index("my_index")
embedder = SentenceTransformerProvider()
vectors = PgVectorStore("postgresql://localhost/mydb", dimension=384)

searcher = HybridSearcher(ix, vectors, embedder, alpha=0.5)
results = searcher.search("query text", limit=10)

Limitations

Partial Implementation

Segment Management:

Unlike FileStorage, BM25Index doesn’t expose direct segment management API. Segments are handled internally by bm25s. This is sufficient for most use cases.

Real-time Updates:

BM25sStore rebuilds the entire index on updates. For applications requiring frequent small updates, consider batching updates or using UnifiedIndex.

Not Implemented

The following FileStorage features are not implemented:

  • Direct segment access/manipulation

  • Custom codecs

  • Per-segment optimization

  • Incremental updates without rebuild

These features are rarely needed and the performance benefits of BM25 far outweigh these limitations for most use cases.

Examples

Basic Usage

from semlix.bm25 import create_bm25_index, open_bm25_index
from semlix.fields import Schema, TEXT, ID
from semlix.qparser import QueryParser

# Create
schema = Schema(id=ID(stored=True), content=TEXT(stored=True))
ix = create_bm25_index("my_index", schema)

# Index
with ix.writer() as writer:
    writer.add_document(id="1", content="Python programming")
    writer.add_document(id="2", content="Database design")

ix.close()

# Open and search
ix = open_bm25_index("my_index")

with ix.searcher() as searcher:
    qp = QueryParser("content", ix.schema)
    results = searcher.search(qp.parse("python"), limit=10)

    for hit in results:
        print(f"{hit['id']}: {hit.score:.3f}")

With Advanced Features

from semlix.bm25 import (
    create_bm25_index,
    PhraseQuery,
    Facets,
    SortBy
)

ix = create_bm25_index("my_index", schema)

# ... index documents ...

with ix.searcher() as searcher:
    # Phrase search
    pq = PhraseQuery("content", ["machine", "learning"])
    results = pq.search(ix, limit=10)

    # Faceting
    facets = Facets(ix)
    qp = QueryParser("content", ix.schema)
    results = searcher.search(qp.parse("python"), limit=100)
    counts = facets.count_by_field(results, "category")

    # Sorting
    sorter = SortBy([("date", True), ("score", True)])
    sorted_results = sorter.sort_results(results)

Performance Comparison

Benchmarks (10K documents, 384-dim vectors):

Metric

FileStorage

BM25Index

Improvement

Search Speed

10-100 q/s

1000+ q/s

10-100x

Index Build Time

~30s

~5s

6x faster

Memory Usage

300MB

100MB

3x less

Concurrent Queries

Limited

Excellent

Much better

See Also