Migration Guide¶
This guide covers migrating indexes between different storage backends in semlix, including upgrading from FileStorage to BM25 or UnifiedIndex.
Overview¶
semlix supports several storage backends:
FileStorage: Traditional Whoosh file-based storage
BM25Index: High-performance BM25 storage (10-100x faster)
UnifiedIndex: Combined BM25 + vector search
Migration tools automate the process of moving data between these backends.
Migration Scenarios¶
FileStorage → BM25Index¶
When to use:
You want 10-100x faster search
You don’t need semantic/vector search
You want lower memory usage
Benefits:
Dramatically faster search (1000+ queries/second)
Lower memory footprint (3x less)
Faster indexing (6x faster)
Same API, drop-in replacement
FileStorage + Vectors → UnifiedIndex¶
When to use:
You’re using HybridSearcher with separate stores
You want unified management of lexical + semantic
You want ACID transactions across both stores
You want automatic embedding generation
Benefits:
Simplified architecture (one index instead of two)
Transactional writes (atomic updates)
Better performance
Enhanced features (faceting on hybrid results, etc.)
NumpyVectorStore → PgVectorStore¶
When to use:
You want better scalability for vectors
You need ACID transactions
You want metadata filtering on vectors
You need backup/recovery tools
Benefits:
PostgreSQL reliability and scalability
HNSW indexing for fast search
JSONB metadata filtering
Professional backup/recovery
Quick Start¶
Simple BM25 Migration¶
from semlix.tools import migrate_to_bm25
migrate_to_bm25(
source_dir="old_whoosh_index",
target_dir="new_bm25_index",
batch_size=1000,
verbose=True
)
The migration will:
Open the source FileStorage index
Create a new BM25 index with the same schema
Copy all documents with progress tracking
Optimize the new index
Simple Unified Migration¶
from semlix.tools import migrate_to_unified
from semlix.semantic import SentenceTransformerProvider
embedder = SentenceTransformerProvider("all-MiniLM-L6-v2")
migrate_to_unified(
source_dir="old_whoosh_index",
target_dir="new_unified_index",
connection_string="postgresql://localhost/mydb",
embedder=embedder,
vector_store_path="old_vectors.pkl", # Optional: reuse existing vectors
batch_size=100,
verbose=True
)
The migration will:
Open source index and optional vector store
Create new UnifiedIndex
Copy documents and vectors
Generate new embeddings for documents without vectors
Optimize both indexes
Detailed Migration¶
Using IndexMigrator¶
For more control, use the IndexMigrator class:
from semlix.tools import IndexMigrator
migrator = IndexMigrator(verbose=True)
# BM25 migration
migrator.migrate_to_bm25(
source_dir="old_index",
target_dir="new_index",
batch_size=1000
)
# Unified migration
migrator.migrate_to_unified(
source_dir="old_index",
target_dir="new_index",
connection_string="postgresql://localhost/mydb",
embedder=embedder,
vector_store_path="vectors.pkl",
batch_size=100
)
# Vectors-only migration
migrator.migrate_vectors_only(
source_store_path="vectors.pkl",
target_connection_string="postgresql://localhost/mydb",
table_name="my_vectors"
)
Custom Migration¶
For advanced scenarios, write custom migration code:
from semlix.tools import IndexMigrator
from semlix.index import open_dir
from semlix.bm25 import create_bm25_index
# Open source
source = open_dir("old_index")
# Create target
target = create_bm25_index("new_index", source.schema)
# Custom migration with filtering
with source.searcher() as searcher:
with target.writer() as writer:
for docnum in range(searcher.reader().doc_count_all()):
fields = searcher.stored_fields(docnum)
# Custom logic: only migrate certain documents
if fields.get("category") in ["important", "archive"]:
writer.add_document(**fields)
if docnum % 100 == 0:
print(f"Processed {docnum} documents")
# Optimize
target.optimize()
source.close()
target.close()
Migration Strategies¶
Zero-Downtime Migration¶
For production systems, use a dual-write strategy:
Phase 1: Dual Write
from semlix.index import open_dir
from semlix.bm25 import open_bm25_index
# Open both indexes
old_ix = open_dir("old_index")
new_ix = open_bm25_index("new_index")
# Write to both
def add_document(**fields):
with old_ix.writer() as w1:
w1.add_document(**fields)
with new_ix.writer() as w2:
w2.add_document(**fields)
Phase 2: Migrate Historical Data
# Migrate old data in background
from semlix.tools import migrate_to_bm25
migrate_to_bm25("old_index", "new_index")
Phase 3: Switch Reads
# Change searcher to use new index
# old: searcher = old_ix.searcher()
searcher = new_ix.searcher()
Phase 4: Remove Old Index
After verifying new index works, remove dual writes and old index.
Incremental Migration¶
For very large indexes, migrate in chunks:
from semlix.index import open_dir
from semlix.bm25 import create_bm25_index, open_bm25_index
source = open_dir("huge_index")
target = create_bm25_index("new_index", source.schema)
chunk_size = 10000
offset = 0
with source.searcher() as searcher:
total = searcher.reader().doc_count_all()
while offset < total:
print(f"Migrating documents {offset} to {offset + chunk_size}")
with target.writer() as writer:
for docnum in range(offset, min(offset + chunk_size, total)):
fields = searcher.stored_fields(docnum)
writer.add_document(**fields)
offset += chunk_size
# Optional: backup checkpoint
target.optimize()
source.close()
target.close()
Testing Migration¶
Always test migration on a copy first:
import shutil
from semlix.tools import migrate_to_bm25
# Copy original index
shutil.copytree("production_index", "test_index")
# Test migration
migrate_to_bm25("test_index", "test_bm25_index")
# Verify document counts
from semlix.index import open_dir
from semlix.bm25 import open_bm25_index
old_ix = open_dir("test_index")
new_ix = open_bm25_index("test_bm25_index")
assert old_ix.doc_count() == new_ix.doc_count()
# Spot check some documents
with old_ix.searcher() as s1, new_ix.searcher() as s2:
old_doc = s1.stored_fields(0)
new_doc = s2.stored_fields(0)
assert old_doc == new_doc
Performance Considerations¶
Migration Speed¶
Typical speeds (10K document index):
BM25 migration: ~5,000 docs/sec
Unified migration (with embeddings): ~100 docs/sec
Vector-only migration: ~10,000 vectors/sec
Factors affecting speed:
Disk I/O speed
Document size
Embedding model speed (for unified migration)
Batch size
Available memory
Optimization¶
Increase batch size for faster migration:
migrate_to_bm25(
source_dir="old_index",
target_dir="new_index",
batch_size=5000 # Default: 1000
)
For unified migration with embeddings:
migrate_to_unified(
source_dir="old_index",
target_dir="new_index",
connection_string=pg_url,
embedder=embedder,
batch_size=500 # Larger batches for embedding generation
)
Memory Usage¶
Migration memory usage depends on batch size:
Batch Size |
BM25 |
Unified |
|---|---|---|
100 |
~50MB |
~100MB |
1000 |
~200MB |
~500MB |
5000 |
~800MB |
~2GB |
For memory-constrained systems, use smaller batches.
Compatibility¶
Schema Compatibility¶
The target index must support all field types in the source schema.
Fully Compatible:
ID, TEXT, KEYWORD, NUMERIC, DATETIME, BOOLEAN
All analyzers (StandardAnalyzer, StemmingAnalyzer, etc.)
Stored and indexed fields
Partially Compatible:
Custom field types may need adjustment
Some FileStorage-specific features not available in BM25
Field Mapping¶
All standard semlix fields migrate automatically:
# Source schema
schema = Schema(
id=ID(stored=True),
title=TEXT(stored=True, analyzer=StandardAnalyzer()),
content=TEXT(stored=True),
tags=KEYWORD(stored=True),
price=NUMERIC(stored=True),
published=DATETIME(stored=True)
)
# Migrates to BM25 with same schema
migrate_to_bm25("old_index", "new_index")
# All fields preserved with same types and analyzers
Data Integrity¶
Verification¶
Always verify migration success:
from semlix.index import open_dir
from semlix.bm25 import open_bm25_index
old_ix = open_dir("old_index")
new_ix = open_bm25_index("new_index")
# Check document count
assert old_ix.doc_count() == new_ix.doc_count(), "Document count mismatch"
# Verify schema
assert old_ix.schema == new_ix.schema, "Schema mismatch"
# Spot check documents
with old_ix.searcher() as s1, new_ix.searcher() as s2:
for i in range(min(100, old_ix.doc_count())):
old_doc = s1.stored_fields(i)
new_doc = s2.stored_fields(i)
assert old_doc == new_doc, f"Document {i} mismatch"
print("✓ Migration verification passed")
Rollback¶
Keep the original index until migration is verified:
# 1. Migrate to new index
migrate_to_bm25("production_index", "new_bm25_index")
# 2. Test new index thoroughly
test_new_index("new_bm25_index")
# 3. Switch application to new index
deploy_with_new_index()
# 4. Monitor for 24-48 hours
# 5. Only then remove old index
# shutil.rmtree("production_index") # Wait until confident
Backup¶
Always backup before migration:
import shutil
import datetime
# Create timestamped backup
timestamp = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")
backup_name = f"production_index_backup_{timestamp}"
shutil.copytree("production_index", backup_name)
# Now safe to migrate
migrate_to_bm25("production_index", "new_bm25_index")
Common Issues¶
Schema Mismatch¶
Problem: Source schema has custom field types not supported by target
Solution: Create compatible schema manually:
from semlix.fields import Schema, TEXT, ID
from semlix.bm25 import create_bm25_index
# Create compatible schema
new_schema = Schema(
id=ID(stored=True),
content=TEXT(stored=True) # Simplified from complex field
)
target = create_bm25_index("new_index", new_schema)
# Custom migration with field mapping
# ... map source fields to target fields ...
Memory Errors¶
Problem: Migration runs out of memory
Solutions:
Reduce batch size:
migrate_to_bm25( source_dir="old_index", target_dir="new_index", batch_size=100 # Smaller batches )
Use incremental migration (see Incremental Migration above)
Increase system memory or swap space
PostgreSQL Connection Errors¶
Problem: “Too many connections” error during unified migration
Solutions:
Increase connection pool size:
from semlix.semantic.stores import PgVectorStore vector_store = PgVectorStore( connection_string=pg_url, pool_size=5 # Reduce from default 10 )
Close connections properly (use context managers)
Increase PostgreSQL max_connections setting
Document Count Mismatch¶
Problem: Target index has fewer documents than source
Causes:
Migration was interrupted
Some documents failed to migrate
Filter was applied (in custom migration)
Solutions:
Check migration logs for errors
Re-run migration from scratch
Use verification script to identify missing documents
Examples¶
Complete BM25 Migration¶
from semlix.tools import migrate_to_bm25
from semlix.index import open_dir
from semlix.bm25 import open_bm25_index
print("Starting migration...")
# Migrate
migrate_to_bm25(
source_dir="whoosh_index",
target_dir="bm25_index",
batch_size=1000,
verbose=True
)
# Verify
old_ix = open_dir("whoosh_index")
new_ix = open_bm25_index("bm25_index")
print(f"Old index: {old_ix.doc_count()} documents")
print(f"New index: {new_ix.doc_count()} documents")
assert old_ix.doc_count() == new_ix.doc_count()
print("✓ Migration successful!")
Complete Unified Migration¶
from semlix.tools import migrate_to_unified
from semlix.semantic import SentenceTransformerProvider
from semlix.unified import open_unified_index
# Setup
embedder = SentenceTransformerProvider("all-MiniLM-L6-v2")
pg_url = "postgresql://localhost/mydb"
print(f"Using embedder: {embedder.model_name}")
print(f"Dimension: {embedder.dimension}")
# Migrate
migrate_to_unified(
source_dir="whoosh_index",
target_dir="unified_index",
connection_string=pg_url,
embedder=embedder,
vector_store_path="old_vectors.pkl",
batch_size=100,
verbose=True
)
# Test
ix = open_unified_index("unified_index", embedder)
with ix.searcher() as searcher:
results = searcher.hybrid_search("test query", limit=5)
print(f"Found {len(results)} results")
print("✓ Unified migration successful!")
Migration with Filtering¶
from semlix.index import open_dir
from semlix.bm25 import create_bm25_index
source = open_dir("all_documents")
target = create_bm25_index("filtered_documents", source.schema)
# Only migrate recent documents
from datetime import datetime, timedelta
cutoff = datetime.now() - timedelta(days=365)
migrated = 0
skipped = 0
with source.searcher() as searcher:
with target.writer() as writer:
for docnum in range(searcher.reader().doc_count_all()):
fields = searcher.stored_fields(docnum)
# Check date
if "published" in fields:
pub_date = fields["published"]
if isinstance(pub_date, datetime) and pub_date >= cutoff:
writer.add_document(**fields)
migrated += 1
else:
skipped += 1
if (migrated + skipped) % 1000 == 0:
print(f"Processed: {migrated + skipped} "
f"(migrated: {migrated}, skipped: {skipped})")
print(f"✓ Filtered migration complete:")
print(f" Migrated: {migrated}")
print(f" Skipped: {skipped}")
Best Practices¶
Always backup first
Test on a copy before migrating production data
Verify document counts after migration
Spot check documents to ensure data integrity
Monitor during migration for errors or issues
Keep old index until new one is proven in production
Document your migration process for future reference
Plan for rollback in case of issues
See Also¶
BM25 Index - BM25 index documentation
Unified Index - Unified index documentation
How to index documents - General indexing concepts