================ BM25 Index ================ semlix now includes a high-performance BM25-based index implementation that provides 10-100x faster search compared to traditional FileStorage, while maintaining full compatibility with the existing Index API. Overview ======== The BM25 module provides a complete alternative to FileStorage using the `bm25s` library for ultra-fast lexical search. It implements the full semlix Index protocol, making it a drop-in replacement with significant performance improvements. **Key Benefits:** * **10-100x faster search**: 1000+ queries/second sustained performance * **Lower memory usage**: 3x less memory than FileStorage * **Full compatibility**: Implements complete Index protocol * **Advanced features**: Phrase queries, faceting, sorting, field caching * **Easy migration**: Automated tools for upgrading from FileStorage Quick Start =========== Creating a BM25 Index ---------------------- Basic index creation:: from semlix.bm25 import create_bm25_index from semlix.fields import Schema, TEXT, ID, KEYWORD from semlix.analysis import StandardAnalyzer schema = Schema( id=ID(stored=True), title=TEXT(stored=True, analyzer=StandardAnalyzer()), content=TEXT(stored=True, analyzer=StandardAnalyzer()), category=KEYWORD(stored=True) ) ix = create_bm25_index("my_bm25_index", schema) Indexing Documents ------------------ Use the standard writer interface:: with ix.writer() as writer: writer.add_document( id="1", title="Introduction to Python", content="Python is a high-level programming language...", category="tutorial" ) writer.add_document( id="2", title="Advanced Python Techniques", content="Learn decorators, generators, and metaclasses...", category="advanced" ) Searching --------- Use the standard searcher interface:: from semlix.qparser import QueryParser with ix.searcher() as searcher: qp = QueryParser("content", ix.schema) query = qp.parse("python programming") results = searcher.search(query, limit=10) for hit in results: print(f"{hit['title']}: {hit.score:.3f}") Opening an Existing Index -------------------------- :: from semlix.bm25 import open_bm25_index ix = open_bm25_index("my_bm25_index") Components ========== BM25Index --------- The main index class that implements the complete semlix Index protocol. **Key Methods:** * ``writer(**kwargs)``: Returns a BM25Writer for indexing * ``searcher(**kwargs)``: Returns a BM25Searcher for searching * ``reader(**kwargs)``: Returns a BM25Reader for document access * ``optimize()``: Rebuilds index for optimal performance * ``doc_count()``: Returns number of indexed documents * ``close()``: Closes the index and frees resources **Properties:** * ``schema``: The index schema * ``index_dir``: Directory containing index files BM25Writer ---------- Handles document indexing operations:: with ix.writer() as writer: # Add new document writer.add_document(id="1", content="Document text") # Update existing document writer.update_document(id="1", content="Updated text") # Delete document writer.delete_document(id="1") # Delete by query from semlix.qparser import QueryParser qp = QueryParser("content", ix.schema) query = qp.parse("obsolete") writer.delete_by_query(query) The writer supports context managers for automatic commit/rollback. BM25Reader ---------- Provides read access to indexed documents:: with ix.reader() as reader: # Get document count count = reader.doc_count() # Get stored fields by document number fields = reader.stored_fields(0) # Get document number by ID docnum = reader.document_number(id="1") # Iterate all documents for doc in reader.iter_docs(): print(doc) BM25Searcher ------------ Executes searches and retrieves results:: with ix.searcher() as searcher: # Basic search results = searcher.search(query, limit=10) # Paginated search results = searcher.search_page(query, pagenum=2, pagelen=10) # Get stored fields fields = searcher.stored_fields(docnum) # Find document by ID docnum = searcher.document_number(id="1") The searcher is fully compatible with QueryParser and HybridSearcher. Advanced Features ================= Phrase Queries -------------- Search for exact phrases with optional word distance (slop):: from semlix.bm25 import PhraseQuery # Exact phrase phrase_query = PhraseQuery( field="content", words=["machine", "learning"], slop=0 ) results = phrase_query.search(ix, limit=10) # With slop (allows words in between) phrase_query = PhraseQuery( field="content", words=["machine", "learning"], slop=2 # Allows up to 2 words between ) Faceting -------- Compute aggregations over search results:: from semlix.bm25 import Facets facets = Facets(ix) with ix.searcher() as searcher: results = searcher.search(query, limit=100) # Count by category category_counts = facets.count_by_field(results, "category") # {"tutorial": 45, "advanced": 32, "reference": 23} # Numeric range facets ranges = [(0, 100), (100, 500), (500, 1000)] range_counts = facets.range_facet(results, "price", ranges) # Date facets date_counts = facets.date_facet(results, "published", gap="month") Sorting ------- Sort results by multiple fields:: from semlix.bm25 import SortBy # Sort by date descending, then score sorter = SortBy([("published", True), ("score", True)]) sorted_results = sorter.sort_results(results) # Convenience methods sorted_by_field = SortBy.by_field(results, "title") sorted_by_score = SortBy.by_score(results, reverse=True) Field Caching ------------- Cache frequently accessed field values for better performance:: from semlix.bm25 import FieldCache cache = FieldCache(ix, max_size=1000) # Cache a field for all documents cache.cache_field("title") # Get cached value (very fast) title = cache.get_cached("doc123", "title") # Invalidate cache when documents change cache.invalidate("doc123") # Single document cache.invalidate() # All documents Configuration ============= BM25 Parameters --------------- You can tune BM25 scoring parameters:: from semlix.stores import BM25sStore store = BM25sStore.create( index_dir="my_index", method="lucene", # or "robertson", "atire", "bm25l", "bm25+" k1=1.5, # Term frequency saturation (default: 1.5) b=0.75, # Length normalization (default: 0.75) delta=0.5 # BM25+ delta parameter (default: 0.5) ) **BM25 Variants:** * ``lucene``: Lucene's BM25 implementation (default, recommended) * ``robertson``: Robertson's original BM25 * ``atire``: ATIRE variant * ``bm25l``: BM25L with better handling of long documents * ``bm25+``: BM25+ with additional tuning parameter Analyzers --------- BM25Index works with all semlix analyzers:: from semlix.analysis import StandardAnalyzer, StemmingAnalyzer, LanguageAnalyzer # Standard analyzer (tokenize, lowercase, stopwords) schema = Schema( content=TEXT(analyzer=StandardAnalyzer()) ) # With stemming schema = Schema( content=TEXT(analyzer=StemmingAnalyzer()) ) # Language-specific schema = Schema( content=TEXT(analyzer=LanguageAnalyzer("spanish")) ) Performance Tuning ================== Indexing Performance -------------------- **Batch Size:** Add documents in batches for best performance:: with ix.writer() as writer: batch = [] for doc in documents: batch.append(doc) if len(batch) >= 1000: for doc_fields in batch: writer.add_document(**doc_fields) batch = [] **Optimization:** Rebuild the index after bulk operations:: ix.optimize() # Rebuilds index for optimal performance Search Performance ------------------ **Memory Mapping:** For large indexes, use memory-mapped files:: from semlix.stores import BM25sStore # When loading store = BM25sStore.load(index_dir, mmap=True) This reduces memory usage and improves cache efficiency. **Field Caching:** Cache frequently accessed fields:: cache = FieldCache(ix, max_size=10000) cache.cache_field("title") cache.cache_field("category") Migration ========= From FileStorage ---------------- Migrate an existing FileStorage index to BM25:: from semlix.tools import migrate_to_bm25 migrate_to_bm25( source_dir="old_whoosh_index", target_dir="new_bm25_index", batch_size=1000 ) The migration process: 1. Opens the source index 2. Creates a new BM25 index with the same schema 3. Copies all documents with progress tracking 4. Optimizes the new index **Custom Migration:** For more control, use IndexMigrator:: from semlix.tools import IndexMigrator from semlix.index import open_dir from semlix.bm25 import create_bm25_index migrator = IndexMigrator(verbose=True) source = open_dir("old_index") target = create_bm25_index("new_index", source.schema) with source.searcher() as searcher: with target.writer() as writer: for docnum in range(searcher.reader().doc_count_all()): fields = searcher.stored_fields(docnum) # Optional: filter documents during migration if should_migrate(fields): writer.add_document(**fields) Compatibility ============= Index Protocol -------------- BM25Index implements the complete semlix Index protocol: * ✅ ``writer()`` / ``reader()`` / ``searcher()`` * ✅ ``optimize()`` / ``doc_count()`` / ``is_empty()`` * ✅ ``add_field()`` / ``remove_field()`` * ✅ ``latest_generation()`` / ``refresh()`` * ✅ Schema management * ✅ Context managers This means BM25Index is a drop-in replacement for FileIndex. HybridSearcher -------------- Works directly with HybridSearcher for semantic search:: from semlix.bm25 import open_bm25_index from semlix.semantic import HybridSearcher, SentenceTransformerProvider from semlix.semantic.stores import PgVectorStore ix = open_bm25_index("my_index") embedder = SentenceTransformerProvider() vectors = PgVectorStore("postgresql://localhost/mydb", dimension=384) searcher = HybridSearcher(ix, vectors, embedder, alpha=0.5) results = searcher.search("query text", limit=10) Limitations =========== Partial Implementation ---------------------- **Segment Management:** Unlike FileStorage, BM25Index doesn't expose direct segment management API. Segments are handled internally by bm25s. This is sufficient for most use cases. **Real-time Updates:** BM25sStore rebuilds the entire index on updates. For applications requiring frequent small updates, consider batching updates or using UnifiedIndex. Not Implemented --------------- The following FileStorage features are not implemented: * Direct segment access/manipulation * Custom codecs * Per-segment optimization * Incremental updates without rebuild These features are rarely needed and the performance benefits of BM25 far outweigh these limitations for most use cases. Examples ======== Basic Usage ----------- :: from semlix.bm25 import create_bm25_index, open_bm25_index from semlix.fields import Schema, TEXT, ID from semlix.qparser import QueryParser # Create schema = Schema(id=ID(stored=True), content=TEXT(stored=True)) ix = create_bm25_index("my_index", schema) # Index with ix.writer() as writer: writer.add_document(id="1", content="Python programming") writer.add_document(id="2", content="Database design") ix.close() # Open and search ix = open_bm25_index("my_index") with ix.searcher() as searcher: qp = QueryParser("content", ix.schema) results = searcher.search(qp.parse("python"), limit=10) for hit in results: print(f"{hit['id']}: {hit.score:.3f}") With Advanced Features ---------------------- :: from semlix.bm25 import ( create_bm25_index, PhraseQuery, Facets, SortBy ) ix = create_bm25_index("my_index", schema) # ... index documents ... with ix.searcher() as searcher: # Phrase search pq = PhraseQuery("content", ["machine", "learning"]) results = pq.search(ix, limit=10) # Faceting facets = Facets(ix) qp = QueryParser("content", ix.schema) results = searcher.search(qp.parse("python"), limit=100) counts = facets.count_by_field(results, "category") # Sorting sorter = SortBy([("date", True), ("score", True)]) sorted_results = sorter.sort_results(results) Performance Comparison ====================== Benchmarks (10K documents, 384-dim vectors): ================== ============ ============ ============== Metric FileStorage BM25Index Improvement ================== ============ ============ ============== Search Speed 10-100 q/s 1000+ q/s 10-100x Index Build Time ~30s ~5s 6x faster Memory Usage 300MB 100MB 3x less Concurrent Queries Limited Excellent Much better ================== ============ ============ ============== See Also ======== * :doc:`unified` - Unified index combining BM25 and vector search * :doc:`semantic` - Semantic search and HybridSearcher * :doc:`indexing` - General indexing concepts * :doc:`searching` - Search query syntax