==================== Migration Guide ==================== This guide covers migrating indexes between different storage backends in semlix, including upgrading from FileStorage to BM25 or UnifiedIndex. Overview ======== semlix supports several storage backends: * **FileStorage**: Traditional Whoosh file-based storage * **BM25Index**: High-performance BM25 storage (10-100x faster) * **UnifiedIndex**: Combined BM25 + vector search Migration tools automate the process of moving data between these backends. Migration Scenarios =================== FileStorage → BM25Index ------------------------ **When to use:** * You want 10-100x faster search * You don't need semantic/vector search * You want lower memory usage **Benefits:** * Dramatically faster search (1000+ queries/second) * Lower memory footprint (3x less) * Faster indexing (6x faster) * Same API, drop-in replacement FileStorage + Vectors → UnifiedIndex ------------------------------------- **When to use:** * You're using HybridSearcher with separate stores * You want unified management of lexical + semantic * You want ACID transactions across both stores * You want automatic embedding generation **Benefits:** * Simplified architecture (one index instead of two) * Transactional writes (atomic updates) * Better performance * Enhanced features (faceting on hybrid results, etc.) NumpyVectorStore → PgVectorStore --------------------------------- **When to use:** * You want better scalability for vectors * You need ACID transactions * You want metadata filtering on vectors * You need backup/recovery tools **Benefits:** * PostgreSQL reliability and scalability * HNSW indexing for fast search * JSONB metadata filtering * Professional backup/recovery Quick Start =========== Simple BM25 Migration ---------------------- :: from semlix.tools import migrate_to_bm25 migrate_to_bm25( source_dir="old_whoosh_index", target_dir="new_bm25_index", batch_size=1000, verbose=True ) The migration will: 1. Open the source FileStorage index 2. Create a new BM25 index with the same schema 3. Copy all documents with progress tracking 4. Optimize the new index Simple Unified Migration ------------------------- :: from semlix.tools import migrate_to_unified from semlix.semantic import SentenceTransformerProvider embedder = SentenceTransformerProvider("all-MiniLM-L6-v2") migrate_to_unified( source_dir="old_whoosh_index", target_dir="new_unified_index", connection_string="postgresql://localhost/mydb", embedder=embedder, vector_store_path="old_vectors.pkl", # Optional: reuse existing vectors batch_size=100, verbose=True ) The migration will: 1. Open source index and optional vector store 2. Create new UnifiedIndex 3. Copy documents and vectors 4. Generate new embeddings for documents without vectors 5. Optimize both indexes Detailed Migration ================== Using IndexMigrator ------------------- For more control, use the IndexMigrator class:: from semlix.tools import IndexMigrator migrator = IndexMigrator(verbose=True) # BM25 migration migrator.migrate_to_bm25( source_dir="old_index", target_dir="new_index", batch_size=1000 ) # Unified migration migrator.migrate_to_unified( source_dir="old_index", target_dir="new_index", connection_string="postgresql://localhost/mydb", embedder=embedder, vector_store_path="vectors.pkl", batch_size=100 ) # Vectors-only migration migrator.migrate_vectors_only( source_store_path="vectors.pkl", target_connection_string="postgresql://localhost/mydb", table_name="my_vectors" ) Custom Migration ---------------- For advanced scenarios, write custom migration code:: from semlix.tools import IndexMigrator from semlix.index import open_dir from semlix.bm25 import create_bm25_index # Open source source = open_dir("old_index") # Create target target = create_bm25_index("new_index", source.schema) # Custom migration with filtering with source.searcher() as searcher: with target.writer() as writer: for docnum in range(searcher.reader().doc_count_all()): fields = searcher.stored_fields(docnum) # Custom logic: only migrate certain documents if fields.get("category") in ["important", "archive"]: writer.add_document(**fields) if docnum % 100 == 0: print(f"Processed {docnum} documents") # Optimize target.optimize() source.close() target.close() Migration Strategies ==================== Zero-Downtime Migration ----------------------- For production systems, use a dual-write strategy: **Phase 1: Dual Write** :: from semlix.index import open_dir from semlix.bm25 import open_bm25_index # Open both indexes old_ix = open_dir("old_index") new_ix = open_bm25_index("new_index") # Write to both def add_document(**fields): with old_ix.writer() as w1: w1.add_document(**fields) with new_ix.writer() as w2: w2.add_document(**fields) **Phase 2: Migrate Historical Data** :: # Migrate old data in background from semlix.tools import migrate_to_bm25 migrate_to_bm25("old_index", "new_index") **Phase 3: Switch Reads** :: # Change searcher to use new index # old: searcher = old_ix.searcher() searcher = new_ix.searcher() **Phase 4: Remove Old Index** After verifying new index works, remove dual writes and old index. Incremental Migration --------------------- For very large indexes, migrate in chunks:: from semlix.index import open_dir from semlix.bm25 import create_bm25_index, open_bm25_index source = open_dir("huge_index") target = create_bm25_index("new_index", source.schema) chunk_size = 10000 offset = 0 with source.searcher() as searcher: total = searcher.reader().doc_count_all() while offset < total: print(f"Migrating documents {offset} to {offset + chunk_size}") with target.writer() as writer: for docnum in range(offset, min(offset + chunk_size, total)): fields = searcher.stored_fields(docnum) writer.add_document(**fields) offset += chunk_size # Optional: backup checkpoint target.optimize() source.close() target.close() Testing Migration ----------------- Always test migration on a copy first:: import shutil from semlix.tools import migrate_to_bm25 # Copy original index shutil.copytree("production_index", "test_index") # Test migration migrate_to_bm25("test_index", "test_bm25_index") # Verify document counts from semlix.index import open_dir from semlix.bm25 import open_bm25_index old_ix = open_dir("test_index") new_ix = open_bm25_index("test_bm25_index") assert old_ix.doc_count() == new_ix.doc_count() # Spot check some documents with old_ix.searcher() as s1, new_ix.searcher() as s2: old_doc = s1.stored_fields(0) new_doc = s2.stored_fields(0) assert old_doc == new_doc Performance Considerations ========================== Migration Speed --------------- **Typical speeds (10K document index):** * BM25 migration: ~5,000 docs/sec * Unified migration (with embeddings): ~100 docs/sec * Vector-only migration: ~10,000 vectors/sec **Factors affecting speed:** * Disk I/O speed * Document size * Embedding model speed (for unified migration) * Batch size * Available memory Optimization ------------ **Increase batch size for faster migration:** :: migrate_to_bm25( source_dir="old_index", target_dir="new_index", batch_size=5000 # Default: 1000 ) **For unified migration with embeddings:** :: migrate_to_unified( source_dir="old_index", target_dir="new_index", connection_string=pg_url, embedder=embedder, batch_size=500 # Larger batches for embedding generation ) Memory Usage ------------ Migration memory usage depends on batch size: ================== ========== ============ Batch Size BM25 Unified ================== ========== ============ 100 ~50MB ~100MB 1000 ~200MB ~500MB 5000 ~800MB ~2GB ================== ========== ============ For memory-constrained systems, use smaller batches. Compatibility ============= Schema Compatibility -------------------- The target index must support all field types in the source schema. **Fully Compatible:** * ID, TEXT, KEYWORD, NUMERIC, DATETIME, BOOLEAN * All analyzers (StandardAnalyzer, StemmingAnalyzer, etc.) * Stored and indexed fields **Partially Compatible:** * Custom field types may need adjustment * Some FileStorage-specific features not available in BM25 Field Mapping ------------- All standard semlix fields migrate automatically:: # Source schema schema = Schema( id=ID(stored=True), title=TEXT(stored=True, analyzer=StandardAnalyzer()), content=TEXT(stored=True), tags=KEYWORD(stored=True), price=NUMERIC(stored=True), published=DATETIME(stored=True) ) # Migrates to BM25 with same schema migrate_to_bm25("old_index", "new_index") # All fields preserved with same types and analyzers Data Integrity ============== Verification ------------ Always verify migration success:: from semlix.index import open_dir from semlix.bm25 import open_bm25_index old_ix = open_dir("old_index") new_ix = open_bm25_index("new_index") # Check document count assert old_ix.doc_count() == new_ix.doc_count(), "Document count mismatch" # Verify schema assert old_ix.schema == new_ix.schema, "Schema mismatch" # Spot check documents with old_ix.searcher() as s1, new_ix.searcher() as s2: for i in range(min(100, old_ix.doc_count())): old_doc = s1.stored_fields(i) new_doc = s2.stored_fields(i) assert old_doc == new_doc, f"Document {i} mismatch" print("✓ Migration verification passed") Rollback -------- Keep the original index until migration is verified:: # 1. Migrate to new index migrate_to_bm25("production_index", "new_bm25_index") # 2. Test new index thoroughly test_new_index("new_bm25_index") # 3. Switch application to new index deploy_with_new_index() # 4. Monitor for 24-48 hours # 5. Only then remove old index # shutil.rmtree("production_index") # Wait until confident Backup ------ Always backup before migration:: import shutil import datetime # Create timestamped backup timestamp = datetime.datetime.now().strftime("%Y%m%d_%H%M%S") backup_name = f"production_index_backup_{timestamp}" shutil.copytree("production_index", backup_name) # Now safe to migrate migrate_to_bm25("production_index", "new_bm25_index") Common Issues ============= Schema Mismatch --------------- **Problem:** Source schema has custom field types not supported by target **Solution:** Create compatible schema manually:: from semlix.fields import Schema, TEXT, ID from semlix.bm25 import create_bm25_index # Create compatible schema new_schema = Schema( id=ID(stored=True), content=TEXT(stored=True) # Simplified from complex field ) target = create_bm25_index("new_index", new_schema) # Custom migration with field mapping # ... map source fields to target fields ... Memory Errors ------------- **Problem:** Migration runs out of memory **Solutions:** 1. Reduce batch size:: migrate_to_bm25( source_dir="old_index", target_dir="new_index", batch_size=100 # Smaller batches ) 2. Use incremental migration (see Incremental Migration above) 3. Increase system memory or swap space PostgreSQL Connection Errors ----------------------------- **Problem:** "Too many connections" error during unified migration **Solutions:** 1. Increase connection pool size:: from semlix.semantic.stores import PgVectorStore vector_store = PgVectorStore( connection_string=pg_url, pool_size=5 # Reduce from default 10 ) 2. Close connections properly (use context managers) 3. Increase PostgreSQL max_connections setting Document Count Mismatch ------------------------ **Problem:** Target index has fewer documents than source **Causes:** * Migration was interrupted * Some documents failed to migrate * Filter was applied (in custom migration) **Solutions:** 1. Check migration logs for errors 2. Re-run migration from scratch 3. Use verification script to identify missing documents Examples ======== Complete BM25 Migration ----------------------- :: from semlix.tools import migrate_to_bm25 from semlix.index import open_dir from semlix.bm25 import open_bm25_index print("Starting migration...") # Migrate migrate_to_bm25( source_dir="whoosh_index", target_dir="bm25_index", batch_size=1000, verbose=True ) # Verify old_ix = open_dir("whoosh_index") new_ix = open_bm25_index("bm25_index") print(f"Old index: {old_ix.doc_count()} documents") print(f"New index: {new_ix.doc_count()} documents") assert old_ix.doc_count() == new_ix.doc_count() print("✓ Migration successful!") Complete Unified Migration --------------------------- :: from semlix.tools import migrate_to_unified from semlix.semantic import SentenceTransformerProvider from semlix.unified import open_unified_index # Setup embedder = SentenceTransformerProvider("all-MiniLM-L6-v2") pg_url = "postgresql://localhost/mydb" print(f"Using embedder: {embedder.model_name}") print(f"Dimension: {embedder.dimension}") # Migrate migrate_to_unified( source_dir="whoosh_index", target_dir="unified_index", connection_string=pg_url, embedder=embedder, vector_store_path="old_vectors.pkl", batch_size=100, verbose=True ) # Test ix = open_unified_index("unified_index", embedder) with ix.searcher() as searcher: results = searcher.hybrid_search("test query", limit=5) print(f"Found {len(results)} results") print("✓ Unified migration successful!") Migration with Filtering ------------------------- :: from semlix.index import open_dir from semlix.bm25 import create_bm25_index source = open_dir("all_documents") target = create_bm25_index("filtered_documents", source.schema) # Only migrate recent documents from datetime import datetime, timedelta cutoff = datetime.now() - timedelta(days=365) migrated = 0 skipped = 0 with source.searcher() as searcher: with target.writer() as writer: for docnum in range(searcher.reader().doc_count_all()): fields = searcher.stored_fields(docnum) # Check date if "published" in fields: pub_date = fields["published"] if isinstance(pub_date, datetime) and pub_date >= cutoff: writer.add_document(**fields) migrated += 1 else: skipped += 1 if (migrated + skipped) % 1000 == 0: print(f"Processed: {migrated + skipped} " f"(migrated: {migrated}, skipped: {skipped})") print(f"✓ Filtered migration complete:") print(f" Migrated: {migrated}") print(f" Skipped: {skipped}") Best Practices ============== 1. **Always backup first** 2. **Test on a copy** before migrating production data 3. **Verify document counts** after migration 4. **Spot check documents** to ensure data integrity 5. **Monitor during migration** for errors or issues 6. **Keep old index** until new one is proven in production 7. **Document your migration** process for future reference 8. **Plan for rollback** in case of issues See Also ======== * :doc:`bm25` - BM25 index documentation * :doc:`unified` - Unified index documentation * :doc:`indexing` - General indexing concepts