================
BM25 Index
================

semlix now includes a high-performance BM25-based index implementation that provides
10-100x faster search compared to traditional FileStorage, while maintaining full
compatibility with the existing Index API.

Overview
========

The BM25 module provides a complete alternative to FileStorage using the `bm25s` library
for ultra-fast lexical search. It implements the full semlix Index protocol, making it
a drop-in replacement with significant performance improvements.

**Key Benefits:**

* **10-100x faster search**: 1000+ queries/second sustained performance
* **Lower memory usage**: 3x less memory than FileStorage
* **Full compatibility**: Implements complete Index protocol
* **Advanced features**: Phrase queries, faceting, sorting, field caching
* **Easy migration**: Automated tools for upgrading from FileStorage

Quick Start
===========

Creating a BM25 Index
----------------------

Basic index creation::

    from semlix.bm25 import create_bm25_index
    from semlix.fields import Schema, TEXT, ID, KEYWORD
    from semlix.analysis import StandardAnalyzer

    schema = Schema(
        id=ID(stored=True),
        title=TEXT(stored=True, analyzer=StandardAnalyzer()),
        content=TEXT(stored=True, analyzer=StandardAnalyzer()),
        category=KEYWORD(stored=True)
    )

    ix = create_bm25_index("my_bm25_index", schema)

Indexing Documents
------------------

Use the standard writer interface::

    with ix.writer() as writer:
        writer.add_document(
            id="1",
            title="Introduction to Python",
            content="Python is a high-level programming language...",
            category="tutorial"
        )
        writer.add_document(
            id="2",
            title="Advanced Python Techniques",
            content="Learn decorators, generators, and metaclasses...",
            category="advanced"
        )

Searching
---------

Use the standard searcher interface::

    from semlix.qparser import QueryParser

    with ix.searcher() as searcher:
        qp = QueryParser("content", ix.schema)
        query = qp.parse("python programming")

        results = searcher.search(query, limit=10)

        for hit in results:
            print(f"{hit['title']}: {hit.score:.3f}")

Opening an Existing Index
--------------------------

::

    from semlix.bm25 import open_bm25_index

    ix = open_bm25_index("my_bm25_index")

Components
==========

BM25Index
---------

The main index class that implements the complete semlix Index protocol.

**Key Methods:**

* ``writer(**kwargs)``: Returns a BM25Writer for indexing
* ``searcher(**kwargs)``: Returns a BM25Searcher for searching
* ``reader(**kwargs)``: Returns a BM25Reader for document access
* ``optimize()``: Rebuilds index for optimal performance
* ``doc_count()``: Returns number of indexed documents
* ``close()``: Closes the index and frees resources

**Properties:**

* ``schema``: The index schema
* ``index_dir``: Directory containing index files

BM25Writer
----------

Handles document indexing operations::

    with ix.writer() as writer:
        # Add new document
        writer.add_document(id="1", content="Document text")

        # Update existing document
        writer.update_document(id="1", content="Updated text")

        # Delete document
        writer.delete_document(id="1")

        # Delete by query
        from semlix.qparser import QueryParser
        qp = QueryParser("content", ix.schema)
        query = qp.parse("obsolete")
        writer.delete_by_query(query)

The writer supports context managers for automatic commit/rollback.

BM25Reader
----------

Provides read access to indexed documents::

    with ix.reader() as reader:
        # Get document count
        count = reader.doc_count()

        # Get stored fields by document number
        fields = reader.stored_fields(0)

        # Get document number by ID
        docnum = reader.document_number(id="1")

        # Iterate all documents
        for doc in reader.iter_docs():
            print(doc)

BM25Searcher
------------

Executes searches and retrieves results::

    with ix.searcher() as searcher:
        # Basic search
        results = searcher.search(query, limit=10)

        # Paginated search
        results = searcher.search_page(query, pagenum=2, pagelen=10)

        # Get stored fields
        fields = searcher.stored_fields(docnum)

        # Find document by ID
        docnum = searcher.document_number(id="1")

The searcher is fully compatible with QueryParser and HybridSearcher.

Advanced Features
=================

Phrase Queries
--------------

Search for exact phrases with optional word distance (slop)::

    from semlix.bm25 import PhraseQuery

    # Exact phrase
    phrase_query = PhraseQuery(
        field="content",
        words=["machine", "learning"],
        slop=0
    )

    results = phrase_query.search(ix, limit=10)

    # With slop (allows words in between)
    phrase_query = PhraseQuery(
        field="content",
        words=["machine", "learning"],
        slop=2  # Allows up to 2 words between
    )

Faceting
--------

Compute aggregations over search results::

    from semlix.bm25 import Facets

    facets = Facets(ix)

    with ix.searcher() as searcher:
        results = searcher.search(query, limit=100)

        # Count by category
        category_counts = facets.count_by_field(results, "category")
        # {"tutorial": 45, "advanced": 32, "reference": 23}

        # Numeric range facets
        ranges = [(0, 100), (100, 500), (500, 1000)]
        range_counts = facets.range_facet(results, "price", ranges)

        # Date facets
        date_counts = facets.date_facet(results, "published", gap="month")

Sorting
-------

Sort results by multiple fields::

    from semlix.bm25 import SortBy

    # Sort by date descending, then score
    sorter = SortBy([("published", True), ("score", True)])
    sorted_results = sorter.sort_results(results)

    # Convenience methods
    sorted_by_field = SortBy.by_field(results, "title")
    sorted_by_score = SortBy.by_score(results, reverse=True)

Field Caching
-------------

Cache frequently accessed field values for better performance::

    from semlix.bm25 import FieldCache

    cache = FieldCache(ix, max_size=1000)

    # Cache a field for all documents
    cache.cache_field("title")

    # Get cached value (very fast)
    title = cache.get_cached("doc123", "title")

    # Invalidate cache when documents change
    cache.invalidate("doc123")  # Single document
    cache.invalidate()          # All documents

Configuration
=============

BM25 Parameters
---------------

You can tune BM25 scoring parameters::

    from semlix.stores import BM25sStore

    store = BM25sStore.create(
        index_dir="my_index",
        method="lucene",    # or "robertson", "atire", "bm25l", "bm25+"
        k1=1.5,            # Term frequency saturation (default: 1.5)
        b=0.75,            # Length normalization (default: 0.75)
        delta=0.5          # BM25+ delta parameter (default: 0.5)
    )

**BM25 Variants:**

* ``lucene``: Lucene's BM25 implementation (default, recommended)
* ``robertson``: Robertson's original BM25
* ``atire``: ATIRE variant
* ``bm25l``: BM25L with better handling of long documents
* ``bm25+``: BM25+ with additional tuning parameter

Analyzers
---------

BM25Index works with all semlix analyzers::

    from semlix.analysis import StandardAnalyzer, StemmingAnalyzer, LanguageAnalyzer

    # Standard analyzer (tokenize, lowercase, stopwords)
    schema = Schema(
        content=TEXT(analyzer=StandardAnalyzer())
    )

    # With stemming
    schema = Schema(
        content=TEXT(analyzer=StemmingAnalyzer())
    )

    # Language-specific
    schema = Schema(
        content=TEXT(analyzer=LanguageAnalyzer("spanish"))
    )

Performance Tuning
==================

Indexing Performance
--------------------

**Batch Size:**

Add documents in batches for best performance::

    with ix.writer() as writer:
        batch = []
        for doc in documents:
            batch.append(doc)

            if len(batch) >= 1000:
                for doc_fields in batch:
                    writer.add_document(**doc_fields)
                batch = []

**Optimization:**

Rebuild the index after bulk operations::

    ix.optimize()  # Rebuilds index for optimal performance

Search Performance
------------------

**Memory Mapping:**

For large indexes, use memory-mapped files::

    from semlix.stores import BM25sStore

    # When loading
    store = BM25sStore.load(index_dir, mmap=True)

This reduces memory usage and improves cache efficiency.

**Field Caching:**

Cache frequently accessed fields::

    cache = FieldCache(ix, max_size=10000)
    cache.cache_field("title")
    cache.cache_field("category")

Migration
=========

From FileStorage
----------------

Migrate an existing FileStorage index to BM25::

    from semlix.tools import migrate_to_bm25

    migrate_to_bm25(
        source_dir="old_whoosh_index",
        target_dir="new_bm25_index",
        batch_size=1000
    )

The migration process:

1. Opens the source index
2. Creates a new BM25 index with the same schema
3. Copies all documents with progress tracking
4. Optimizes the new index

**Custom Migration:**

For more control, use IndexMigrator::

    from semlix.tools import IndexMigrator
    from semlix.index import open_dir
    from semlix.bm25 import create_bm25_index

    migrator = IndexMigrator(verbose=True)

    source = open_dir("old_index")
    target = create_bm25_index("new_index", source.schema)

    with source.searcher() as searcher:
        with target.writer() as writer:
            for docnum in range(searcher.reader().doc_count_all()):
                fields = searcher.stored_fields(docnum)

                # Optional: filter documents during migration
                if should_migrate(fields):
                    writer.add_document(**fields)

Compatibility
=============

Index Protocol
--------------

BM25Index implements the complete semlix Index protocol:

* ✅ ``writer()`` / ``reader()`` / ``searcher()``
* ✅ ``optimize()`` / ``doc_count()`` / ``is_empty()``
* ✅ ``add_field()`` / ``remove_field()``
* ✅ ``latest_generation()`` / ``refresh()``
* ✅ Schema management
* ✅ Context managers

This means BM25Index is a drop-in replacement for FileIndex.

HybridSearcher
--------------

Works directly with HybridSearcher for semantic search::

    from semlix.bm25 import open_bm25_index
    from semlix.semantic import HybridSearcher, SentenceTransformerProvider
    from semlix.semantic.stores import PgVectorStore

    ix = open_bm25_index("my_index")
    embedder = SentenceTransformerProvider()
    vectors = PgVectorStore("postgresql://localhost/mydb", dimension=384)

    searcher = HybridSearcher(ix, vectors, embedder, alpha=0.5)
    results = searcher.search("query text", limit=10)

Limitations
===========

Partial Implementation
----------------------

**Segment Management:**

Unlike FileStorage, BM25Index doesn't expose direct segment management API.
Segments are handled internally by bm25s. This is sufficient for most use cases.

**Real-time Updates:**

BM25sStore rebuilds the entire index on updates. For applications requiring
frequent small updates, consider batching updates or using UnifiedIndex.

Not Implemented
---------------

The following FileStorage features are not implemented:

* Direct segment access/manipulation
* Custom codecs
* Per-segment optimization
* Incremental updates without rebuild

These features are rarely needed and the performance benefits of BM25 far
outweigh these limitations for most use cases.

Examples
========

Basic Usage
-----------

::

    from semlix.bm25 import create_bm25_index, open_bm25_index
    from semlix.fields import Schema, TEXT, ID
    from semlix.qparser import QueryParser

    # Create
    schema = Schema(id=ID(stored=True), content=TEXT(stored=True))
    ix = create_bm25_index("my_index", schema)

    # Index
    with ix.writer() as writer:
        writer.add_document(id="1", content="Python programming")
        writer.add_document(id="2", content="Database design")

    ix.close()

    # Open and search
    ix = open_bm25_index("my_index")

    with ix.searcher() as searcher:
        qp = QueryParser("content", ix.schema)
        results = searcher.search(qp.parse("python"), limit=10)

        for hit in results:
            print(f"{hit['id']}: {hit.score:.3f}")

With Advanced Features
----------------------

::

    from semlix.bm25 import (
        create_bm25_index,
        PhraseQuery,
        Facets,
        SortBy
    )

    ix = create_bm25_index("my_index", schema)

    # ... index documents ...

    with ix.searcher() as searcher:
        # Phrase search
        pq = PhraseQuery("content", ["machine", "learning"])
        results = pq.search(ix, limit=10)

        # Faceting
        facets = Facets(ix)
        qp = QueryParser("content", ix.schema)
        results = searcher.search(qp.parse("python"), limit=100)
        counts = facets.count_by_field(results, "category")

        # Sorting
        sorter = SortBy([("date", True), ("score", True)])
        sorted_results = sorter.sort_results(results)

Performance Comparison
======================

Benchmarks (10K documents, 384-dim vectors):

==================  ============  ============  ==============
Metric              FileStorage   BM25Index     Improvement
==================  ============  ============  ==============
Search Speed        10-100 q/s    1000+ q/s     10-100x
Index Build Time    ~30s          ~5s           6x faster
Memory Usage        300MB         100MB         3x less
Concurrent Queries  Limited       Excellent     Much better
==================  ============  ============  ==============

See Also
========

* :doc:`unified` - Unified index combining BM25 and vector search
* :doc:`semantic` - Semantic search and HybridSearcher
* :doc:`indexing` - General indexing concepts
* :doc:`searching` - Search query syntax