================ Unified Index ================ UnifiedIndex combines high-performance BM25 lexical search with pgvector semantic search in a single, unified interface. This provides the best of both worlds: fast keyword matching and semantic understanding. Overview ======== UnifiedIndex automatically manages both a BM25 index for lexical search and a PostgreSQL vector store for semantic search. Documents are indexed in both stores simultaneously, and searches can leverage either or both approaches. **Key Benefits:** * **Hybrid search out-of-the-box**: No manual setup required * **Transactional writes**: Atomic updates across both stores * **Automatic embeddings**: Generates vectors during indexing * **Enhanced features**: Faceting, sorting, and phrase queries on hybrid results * **Production-ready**: ACID transactions, scalable PostgreSQL backend Quick Start =========== Creating a Unified Index ------------------------- :: from semlix.unified import create_unified_index from semlix.fields import Schema, TEXT, ID, KEYWORD, DATETIME from semlix.semantic import SentenceTransformerProvider from semlix.analysis import StandardAnalyzer # Define schema schema = Schema( id=ID(stored=True), title=TEXT(stored=True, analyzer=StandardAnalyzer()), content=TEXT(stored=True, analyzer=StandardAnalyzer()), author=KEYWORD(stored=True), category=KEYWORD(stored=True), published=DATETIME(stored=True) ) # Create embedding provider embedder = SentenceTransformerProvider("all-MiniLM-L6-v2") # Create unified index ix = create_unified_index( index_dir="my_unified_index", schema=schema, connection_string="postgresql://localhost/mydb", embedder=embedder ) Prerequisites ------------- UnifiedIndex requires: 1. **PostgreSQL with pgvector extension**:: # Install extension CREATE EXTENSION vector; 2. **Python packages**:: pip install bm25s sentence-transformers psycopg2-binary pgvector Indexing Documents ------------------ Use the unified writer to add documents to both indexes:: with ix.writer() as writer: writer.add_document( id="1", title="Introduction to Machine Learning", content="Machine learning enables systems to learn from data...", author="Alice", category="ai", published="2024-01-15" ) writer.add_document( id="2", title="Python Programming Guide", content="Learn Python programming best practices...", author="Bob", category="programming", published="2024-02-20" ) The writer automatically: 1. Indexes documents in the BM25 index 2. Generates embeddings for specified fields 3. Stores vectors in PostgreSQL 4. Commits both atomically Searching ========= Hybrid Search ------------- Combine lexical and semantic search (default):: with ix.searcher() as searcher: results = searcher.hybrid_search( "machine learning algorithms", limit=10, alpha=0.5 # 0=all lexical, 1=all semantic ) for r in results: print(f"{r.stored_fields['title']}") print(f" Combined: {r.score:.3f}") print(f" Lexical: {r.lexical_score:.3f}") print(f" Semantic: {r.semantic_score:.3f}") **Alpha Parameter:** * ``alpha=0.0``: Pure lexical search (BM25) * ``alpha=0.5``: Balanced hybrid (recommended) * ``alpha=1.0``: Pure semantic search (vector) Lexical-Only Search ------------------- Use BM25 only for exact keyword matching:: with ix.searcher() as searcher: results = searcher.lexical_only("python programming", limit=10) Semantic-Only Search -------------------- Use vectors only for conceptual queries:: with ix.searcher() as searcher: # Finds conceptually similar docs even without keyword overlap results = searcher.semantic_only("AI and neural networks", limit=10) Components ========== UnifiedIndex ------------ The main index class combining BM25 and vector search. **Constructor Parameters:** * ``index_dir``: Directory for the index * ``schema``: Field schema * ``connection_string``: PostgreSQL connection URL * ``embedder``: Embedding provider * ``id_field``: Field containing document IDs (default: "id") * ``searchable_fields``: Fields to use for embeddings (default: all TEXT fields) **Methods:** * ``writer(**kwargs)``: Returns UnifiedWriter * ``searcher(**kwargs)``: Returns UnifiedSearcher * ``reader(**kwargs)``: Returns BM25Reader * ``optimize()``: Optimizes both indexes * ``doc_count()``: Returns document count * ``close()``: Closes both stores UnifiedWriter ------------- Handles transactional writes across both stores:: with ix.writer() as writer: # Add document (indexed in both BM25 and vectors) writer.add_document(id="1", content="Document text") # Update document (deletes old, adds new in both stores) writer.update_document(id="1", content="Updated text") # Delete document (removes from both stores) writer.delete_document(id="1") # Delete by query from semlix.qparser import QueryParser qp = QueryParser("content", ix.schema) query = qp.parse("obsolete") writer.delete_by_query(query) **Transaction Guarantees:** * Writes are atomic across both stores * If vector storage fails, BM25 changes roll back * Automatic embedding generation * Configurable batch processing UnifiedSearcher --------------- Enhanced searcher with hybrid search capabilities:: with ix.searcher() as searcher: # Hybrid search results = searcher.hybrid_search("query", alpha=0.5) # With facets results, facets = searcher.search_with_facets( "python", facet_fields=["category", "author"], limit=100 ) # Phrase search results = searcher.phrase_search( "content", "machine learning", slop=0 ) # Sorted search results = searcher.search_sorted( "python", sort_by=[("published", True), ("score", True)], limit=10 ) **Methods:** * ``hybrid_search(...)``: Combined lexical + semantic * ``lexical_only(...)``: BM25 only * ``semantic_only(...)``: Vector only * ``search_with_facets(...)``: Hybrid search with aggregations * ``phrase_search(...)``: Exact phrase matching * ``sort_results(...)``: Sort existing results * ``search_sorted(...)``: Search with custom sorting Advanced Features ================= Faceted Hybrid Search --------------------- Combine hybrid search with faceting:: with ix.searcher() as searcher: results, facets = searcher.search_with_facets( "machine learning", facet_fields=["category", "author", "year"], limit=100, facet_limit=10, alpha=0.5 ) # Access results for r in results[:10]: print(r.stored_fields['title']) # Access facets print("Categories:", facets["category"]) # {"ai": 45, "programming": 32, "database": 12} print("Authors:", facets["author"]) # {"Alice": 23, "Bob": 18, "Charlie": 15} Phrase Queries -------------- Find exact phrases in hybrid results:: with ix.searcher() as searcher: # Exact phrase results = searcher.phrase_search( field="content", phrase="machine learning", slop=0, limit=10 ) # With slop (allows words in between) results = searcher.phrase_search( field="content", phrase="machine learning", slop=2, # "machine X Y learning" matches limit=10 ) Sorted Hybrid Search -------------------- Sort hybrid results by custom criteria:: with ix.searcher() as searcher: # Sort by date (newest first), then by relevance score results = searcher.search_sorted( "python programming", sort_by=[ ("published", True), # Descending ("score", True) # Descending ], limit=20, alpha=0.5 ) for r in results: doc = r.stored_fields print(f"{doc['title']} - {doc['published']}") Configuration ============= Embedding Provider ------------------ Choose an embedding model based on your needs:: from semlix.semantic import SentenceTransformerProvider # Fast and lightweight (384-dim) embedder = SentenceTransformerProvider("all-MiniLM-L6-v2") # Better quality (768-dim) embedder = SentenceTransformerProvider("all-mpnet-base-v2") # Multilingual embedder = SentenceTransformerProvider("paraphrase-multilingual-MiniLM-L12-v2") Vector Store Configuration -------------------------- Configure PostgreSQL vector storage:: from semlix.semantic.stores import PgVectorStore vector_store = PgVectorStore( connection_string="postgresql://localhost/mydb", dimension=384, distance_metric="cosine", # or "l2", "inner_product" pool_size=10 ) # Create HNSW index for fast search vector_store.create_index( index_type="hnsw", m=16, # HNSW parameter ef_construction=64 # HNSW parameter ) Searchable Fields ----------------- Control which fields are used for embeddings:: ix = create_unified_index( index_dir="my_index", schema=schema, connection_string=pg_url, embedder=embedder, searchable_fields=["title", "content"] # Only these fields ) By default, all TEXT fields are used for embedding generation. Fusion Methods -------------- Choose how to combine lexical and semantic scores:: from semlix.semantic.fusion import FusionMethod with ix.searcher() as searcher: results = searcher.hybrid_search( "query", fusion_method=FusionMethod.RRF, # Reciprocal Rank Fusion alpha=0.5 ) **Available Methods:** * ``RRF`` (Reciprocal Rank Fusion): Recommended, parameter-free * ``LINEAR``: Weighted linear combination * ``DBSF`` (Distribution-Based Score Fusion): Normalizes score distributions * ``RELATIVE_SCORE``: Relative scoring normalization Migration ========= From FileStorage + NumpyVectorStore ------------------------------------ Migrate existing indexes to UnifiedIndex:: from semlix.tools import migrate_to_unified from semlix.semantic import SentenceTransformerProvider embedder = SentenceTransformerProvider() migrate_to_unified( source_dir="old_whoosh_index", target_dir="new_unified_index", connection_string="postgresql://localhost/mydb", embedder=embedder, vector_store_path="old_vectors.pkl", # Reuse existing vectors batch_size=100 ) **Migration Process:** 1. Opens source index and vector store 2. Creates new UnifiedIndex 3. Migrates documents with embeddings 4. Reuses existing vectors when available 5. Generates new vectors for missing documents 6. Optimizes both indexes From BM25Index -------------- Add vector search to existing BM25 index:: from semlix.bm25 import open_bm25_index from semlix.unified import UnifiedIndex from semlix.semantic import SentenceTransformerProvider from semlix.semantic.stores import PgVectorStore # Open existing BM25 index bm25_ix = open_bm25_index("my_bm25_index") # Create vector store embedder = SentenceTransformerProvider() vector_store = PgVectorStore( "postgresql://localhost/mydb", dimension=embedder.dimension ) # Generate embeddings for existing documents docs = [] with bm25_ix.reader() as reader: for doc in reader.iter_docs(): docs.append(doc) # Extract text and generate embeddings texts = [doc.get("content", "") for doc in docs] doc_ids = [doc.get("id", str(i)) for i, doc in enumerate(docs)] embeddings = embedder.encode(texts) # Add to vector store vector_store.add(doc_ids, embeddings) # Create unified index unified_ix = UnifiedIndex( index_dir="unified_index", schema=bm25_ix.schema, connection_string="postgresql://localhost/mydb", embedder=embedder, bm25_index=bm25_ix, vector_store=vector_store ) Performance =========== Search Performance ------------------ **Hybrid Search:** * 500+ queries/second (10K documents) * ~5-10ms latency (p50) * Scales well with document count **Lexical-Only:** * 1000+ queries/second * ~1-2ms latency **Semantic-Only:** * ~100 queries/second (with HNSW index) * ~10-20ms latency Indexing Performance -------------------- **With Embedding Generation:** * ~100 documents/second * Depends on embedding model speed * Can batch for better throughput **Optimization:** Use batch processing for bulk indexing:: batch_size = 100 batch = [] with ix.writer() as writer: for doc in documents: batch.append(doc) if len(batch) >= batch_size: for doc_fields in batch: writer.add_document(**doc_fields) batch = [] Memory Usage ------------ ================== ======== ========== Component 10K docs 100K docs ================== ======== ========== BM25 Index 100MB 500MB Vector Store (PG) 40MB 400MB Total (approx) 140MB 900MB ================== ======== ========== Disk Usage ---------- ================== ======== ========== Component 10K docs 100K docs ================== ======== ========== BM25 Index 50MB 250MB PostgreSQL (total) 100MB 800MB Total (approx) 150MB 1050MB ================== ======== ========== Examples ======== Basic Hybrid Search ------------------- :: from semlix.unified import create_unified_index from semlix.fields import Schema, TEXT, ID from semlix.semantic import SentenceTransformerProvider schema = Schema(id=ID(stored=True), content=TEXT(stored=True)) embedder = SentenceTransformerProvider() ix = create_unified_index( "my_index", schema, "postgresql://localhost/mydb", embedder ) # Index with ix.writer() as writer: writer.add_document( id="1", content="Python is a programming language" ) writer.add_document( id="2", content="Machine learning uses neural networks" ) # Search with ix.searcher() as searcher: # Hybrid: finds both keyword and semantic matches results = searcher.hybrid_search("coding in python", limit=10) Complete Example with All Features ----------------------------------- :: from semlix.unified import create_unified_index from semlix.fields import Schema, TEXT, ID, KEYWORD, DATETIME from semlix.semantic import SentenceTransformerProvider schema = Schema( id=ID(stored=True), title=TEXT(stored=True), content=TEXT(stored=True), category=KEYWORD(stored=True), published=DATETIME(stored=True) ) embedder = SentenceTransformerProvider("all-MiniLM-L6-v2") ix = create_unified_index( "my_index", schema, "postgresql://localhost/mydb", embedder ) # Index documents with ix.writer() as writer: writer.add_document( id="1", title="AI Basics", content="Introduction to artificial intelligence...", category="ai", published="2024-01-15" ) # ... more documents ... # Search with all features with ix.searcher() as searcher: # Hybrid search with facets results, facets = searcher.search_with_facets( "artificial intelligence", facet_fields=["category"], limit=50, alpha=0.5 ) # Sort by date sorted_results = searcher.sort_results( results, [("published", True)] ) # Phrase search phrase_results = searcher.phrase_search( "content", "machine learning" ) Best Practices ============== 1. **Choose appropriate alpha:** * Use ``alpha=0.3-0.5`` for balanced search * Use ``alpha=0.0`` for exact keyword matching * Use ``alpha=0.8-1.0`` for conceptual/semantic queries 2. **Batch indexing for performance:** * Index in batches of 100-1000 documents * Commit once per batch, not per document 3. **Create HNSW index for vectors:** * Essential for good semantic search performance * Create after bulk indexing:: ix.optimize() # Optimizes both BM25 and vector indexes 4. **Choose embedding model wisely:** * Start with ``all-MiniLM-L6-v2`` (fast, good quality) * Upgrade to ``all-mpnet-base-v2`` if quality matters more than speed * Use multilingual models only if needed 5. **Monitor PostgreSQL:** * Regular VACUUM ANALYZE * Monitor connection pool usage * Consider replication for high availability Troubleshooting =============== Slow Semantic Search -------------------- **Problem:** Vector search is slow (>100ms per query) **Solutions:** 1. Create HNSW index:: ix.vector_store.create_index(index_type="hnsw") 2. Tune HNSW parameters:: ix.vector_store.create_index( index_type="hnsw", m=32, # Higher = better quality, slower build ef_construction=128 # Higher = better quality, slower build ) Memory Issues ------------- **Problem:** High memory usage during indexing **Solutions:** 1. Use smaller batches 2. Enable memory mapping for BM25:: from semlix.stores import BM25sStore store = BM25sStore.load(index_dir, mmap=True) 3. Reduce connection pool size Connection Pool Exhausted -------------------------- **Problem:** PostgreSQL connection errors **Solutions:** 1. Increase pool size:: vector_store = PgVectorStore( connection_string=pg_url, pool_size=50 # Increase from default 10 ) 2. Close searchers when done 3. Use context managers (``with`` statements) See Also ======== * :doc:`bm25` - BM25 index documentation * :doc:`semantic` - Semantic search and HybridSearcher * :doc:`indexing` - General indexing concepts * :doc:`searching` - Search and query syntax