Migration Guide¶

This guide covers migrating indexes between different storage backends in semlix, including upgrading from FileStorage to BM25 or UnifiedIndex.

Overview¶

semlix supports several storage backends:

FileStorage: Traditional Whoosh file-based storage
BM25Index: High-performance BM25 storage (10-100x faster)
UnifiedIndex: Combined BM25 + vector search

Migration tools automate the process of moving data between these backends.

Migration Scenarios¶

FileStorage → BM25Index¶

When to use:

You want 10-100x faster search
You don’t need semantic/vector search
You want lower memory usage

Benefits:

Dramatically faster search (1000+ queries/second)
Lower memory footprint (3x less)
Faster indexing (6x faster)
Same API, drop-in replacement

FileStorage + Vectors → UnifiedIndex¶

When to use:

You’re using HybridSearcher with separate stores
You want unified management of lexical + semantic
You want ACID transactions across both stores
You want automatic embedding generation

Benefits:

Simplified architecture (one index instead of two)
Transactional writes (atomic updates)
Better performance
Enhanced features (faceting on hybrid results, etc.)

NumpyVectorStore → PgVectorStore¶

When to use:

You want better scalability for vectors
You need ACID transactions
You want metadata filtering on vectors
You need backup/recovery tools

Benefits:

PostgreSQL reliability and scalability
HNSW indexing for fast search
JSONB metadata filtering
Professional backup/recovery

Quick Start¶

Simple BM25 Migration¶

from semlix.tools import migrate_to_bm25

migrate_to_bm25(
    source_dir="old_whoosh_index",
    target_dir="new_bm25_index",
    batch_size=1000,
    verbose=True
)

The migration will:

Open the source FileStorage index
Create a new BM25 index with the same schema
Copy all documents with progress tracking
Optimize the new index

Simple Unified Migration¶

from semlix.tools import migrate_to_unified
from semlix.semantic import SentenceTransformerProvider

embedder = SentenceTransformerProvider("all-MiniLM-L6-v2")

migrate_to_unified(
    source_dir="old_whoosh_index",
    target_dir="new_unified_index",
    connection_string="postgresql://localhost/mydb",
    embedder=embedder,
    vector_store_path="old_vectors.pkl",  # Optional: reuse existing vectors
    batch_size=100,
    verbose=True
)

The migration will:

Open source index and optional vector store
Create new UnifiedIndex
Copy documents and vectors
Generate new embeddings for documents without vectors
Optimize both indexes

Detailed Migration¶

Using IndexMigrator¶

For more control, use the IndexMigrator class:

from semlix.tools import IndexMigrator

migrator = IndexMigrator(verbose=True)

# BM25 migration
migrator.migrate_to_bm25(
    source_dir="old_index",
    target_dir="new_index",
    batch_size=1000
)

# Unified migration
migrator.migrate_to_unified(
    source_dir="old_index",
    target_dir="new_index",
    connection_string="postgresql://localhost/mydb",
    embedder=embedder,
    vector_store_path="vectors.pkl",
    batch_size=100
)

# Vectors-only migration
migrator.migrate_vectors_only(
    source_store_path="vectors.pkl",
    target_connection_string="postgresql://localhost/mydb",
    table_name="my_vectors"
)

Custom Migration¶

For advanced scenarios, write custom migration code:

from semlix.tools import IndexMigrator
from semlix.index import open_dir
from semlix.bm25 import create_bm25_index

# Open source
source = open_dir("old_index")

# Create target
target = create_bm25_index("new_index", source.schema)

# Custom migration with filtering
with source.searcher() as searcher:
    with target.writer() as writer:
        for docnum in range(searcher.reader().doc_count_all()):
            fields = searcher.stored_fields(docnum)

            # Custom logic: only migrate certain documents
            if fields.get("category") in ["important", "archive"]:
                writer.add_document(**fields)

                if docnum % 100 == 0:
                    print(f"Processed {docnum} documents")

# Optimize
target.optimize()

source.close()
target.close()

Migration Strategies¶

Zero-Downtime Migration¶

For production systems, use a dual-write strategy:

Phase 1: Dual Write

from semlix.index import open_dir
from semlix.bm25 import open_bm25_index

# Open both indexes
old_ix = open_dir("old_index")
new_ix = open_bm25_index("new_index")

# Write to both
def add_document(**fields):
    with old_ix.writer() as w1:
        w1.add_document(**fields)

    with new_ix.writer() as w2:
        w2.add_document(**fields)

Phase 2: Migrate Historical Data

# Migrate old data in background
from semlix.tools import migrate_to_bm25

migrate_to_bm25("old_index", "new_index")

Phase 3: Switch Reads

# Change searcher to use new index
# old: searcher = old_ix.searcher()
searcher = new_ix.searcher()

Phase 4: Remove Old Index

After verifying new index works, remove dual writes and old index.

Incremental Migration¶

For very large indexes, migrate in chunks:

from semlix.index import open_dir
from semlix.bm25 import create_bm25_index, open_bm25_index

source = open_dir("huge_index")
target = create_bm25_index("new_index", source.schema)

chunk_size = 10000
offset = 0

with source.searcher() as searcher:
    total = searcher.reader().doc_count_all()

    while offset < total:
        print(f"Migrating documents {offset} to {offset + chunk_size}")

        with target.writer() as writer:
            for docnum in range(offset, min(offset + chunk_size, total)):
                fields = searcher.stored_fields(docnum)
                writer.add_document(**fields)

        offset += chunk_size

        # Optional: backup checkpoint
        target.optimize()

source.close()
target.close()

Testing Migration¶

Always test migration on a copy first:

import shutil
from semlix.tools import migrate_to_bm25

# Copy original index
shutil.copytree("production_index", "test_index")

# Test migration
migrate_to_bm25("test_index", "test_bm25_index")

# Verify document counts
from semlix.index import open_dir
from semlix.bm25 import open_bm25_index

old_ix = open_dir("test_index")
new_ix = open_bm25_index("test_bm25_index")

assert old_ix.doc_count() == new_ix.doc_count()

# Spot check some documents
with old_ix.searcher() as s1, new_ix.searcher() as s2:
    old_doc = s1.stored_fields(0)
    new_doc = s2.stored_fields(0)
    assert old_doc == new_doc

Performance Considerations¶

Migration Speed¶

Typical speeds (10K document index):

BM25 migration: ~5,000 docs/sec
Unified migration (with embeddings): ~100 docs/sec
Vector-only migration: ~10,000 vectors/sec

Factors affecting speed:

Disk I/O speed
Document size
Embedding model speed (for unified migration)
Batch size
Available memory

Optimization¶

Increase batch size for faster migration:

migrate_to_bm25(
    source_dir="old_index",
    target_dir="new_index",
    batch_size=5000  # Default: 1000
)

For unified migration with embeddings:

migrate_to_unified(
    source_dir="old_index",
    target_dir="new_index",
    connection_string=pg_url,
    embedder=embedder,
    batch_size=500  # Larger batches for embedding generation
)

Memory Usage¶

Migration memory usage depends on batch size:

Batch Size	BM25	Unified
100	~50MB	~100MB
1000	~200MB	~500MB
5000	~800MB	~2GB

For memory-constrained systems, use smaller batches.

Compatibility¶

Schema Compatibility¶

The target index must support all field types in the source schema.

Fully Compatible:

ID, TEXT, KEYWORD, NUMERIC, DATETIME, BOOLEAN
All analyzers (StandardAnalyzer, StemmingAnalyzer, etc.)
Stored and indexed fields

Partially Compatible:

Custom field types may need adjustment
Some FileStorage-specific features not available in BM25

Field Mapping¶

All standard semlix fields migrate automatically:

# Source schema
schema = Schema(
    id=ID(stored=True),
    title=TEXT(stored=True, analyzer=StandardAnalyzer()),
    content=TEXT(stored=True),
    tags=KEYWORD(stored=True),
    price=NUMERIC(stored=True),
    published=DATETIME(stored=True)
)

# Migrates to BM25 with same schema
migrate_to_bm25("old_index", "new_index")

# All fields preserved with same types and analyzers

Data Integrity¶

Verification¶

Always verify migration success:

from semlix.index import open_dir
from semlix.bm25 import open_bm25_index

old_ix = open_dir("old_index")
new_ix = open_bm25_index("new_index")

# Check document count
assert old_ix.doc_count() == new_ix.doc_count(), "Document count mismatch"

# Verify schema
assert old_ix.schema == new_ix.schema, "Schema mismatch"

# Spot check documents
with old_ix.searcher() as s1, new_ix.searcher() as s2:
    for i in range(min(100, old_ix.doc_count())):
        old_doc = s1.stored_fields(i)
        new_doc = s2.stored_fields(i)

        assert old_doc == new_doc, f"Document {i} mismatch"

print("✓ Migration verification passed")

Rollback¶

Keep the original index until migration is verified:

# 1. Migrate to new index
migrate_to_bm25("production_index", "new_bm25_index")

# 2. Test new index thoroughly
test_new_index("new_bm25_index")

# 3. Switch application to new index
deploy_with_new_index()

# 4. Monitor for 24-48 hours

# 5. Only then remove old index
# shutil.rmtree("production_index")  # Wait until confident

Backup¶

Always backup before migration:

import shutil
import datetime

# Create timestamped backup
timestamp = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")
backup_name = f"production_index_backup_{timestamp}"

shutil.copytree("production_index", backup_name)

# Now safe to migrate
migrate_to_bm25("production_index", "new_bm25_index")

Common Issues¶

Schema Mismatch¶

Problem: Source schema has custom field types not supported by target

Solution: Create compatible schema manually:

from semlix.fields import Schema, TEXT, ID
from semlix.bm25 import create_bm25_index

# Create compatible schema
new_schema = Schema(
    id=ID(stored=True),
    content=TEXT(stored=True)  # Simplified from complex field
)

target = create_bm25_index("new_index", new_schema)

# Custom migration with field mapping
# ... map source fields to target fields ...

Memory Errors¶

Problem: Migration runs out of memory

Solutions:

Reduce batch size:

migrate_to_bm25(
    source_dir="old_index",
    target_dir="new_index",
    batch_size=100  # Smaller batches
)

Use incremental migration (see Incremental Migration above)
Increase system memory or swap space

PostgreSQL Connection Errors¶

Problem: “Too many connections” error during unified migration

Solutions:

Increase connection pool size:

from semlix.semantic.stores import PgVectorStore

vector_store = PgVectorStore(
    connection_string=pg_url,
    pool_size=5  # Reduce from default 10
)

Close connections properly (use context managers)
Increase PostgreSQL max_connections setting

Document Count Mismatch¶

Problem: Target index has fewer documents than source

Causes:

Migration was interrupted
Some documents failed to migrate
Filter was applied (in custom migration)

Solutions:

Check migration logs for errors
Re-run migration from scratch
Use verification script to identify missing documents

Examples¶

Complete BM25 Migration¶

from semlix.tools import migrate_to_bm25
from semlix.index import open_dir
from semlix.bm25 import open_bm25_index

print("Starting migration...")

# Migrate
migrate_to_bm25(
    source_dir="whoosh_index",
    target_dir="bm25_index",
    batch_size=1000,
    verbose=True
)

# Verify
old_ix = open_dir("whoosh_index")
new_ix = open_bm25_index("bm25_index")

print(f"Old index: {old_ix.doc_count()} documents")
print(f"New index: {new_ix.doc_count()} documents")

assert old_ix.doc_count() == new_ix.doc_count()
print("✓ Migration successful!")

Complete Unified Migration¶

from semlix.tools import migrate_to_unified
from semlix.semantic import SentenceTransformerProvider
from semlix.unified import open_unified_index

# Setup
embedder = SentenceTransformerProvider("all-MiniLM-L6-v2")
pg_url = "postgresql://localhost/mydb"

print(f"Using embedder: {embedder.model_name}")
print(f"Dimension: {embedder.dimension}")

# Migrate
migrate_to_unified(
    source_dir="whoosh_index",
    target_dir="unified_index",
    connection_string=pg_url,
    embedder=embedder,
    vector_store_path="old_vectors.pkl",
    batch_size=100,
    verbose=True
)

# Test
ix = open_unified_index("unified_index", embedder)

with ix.searcher() as searcher:
    results = searcher.hybrid_search("test query", limit=5)
    print(f"Found {len(results)} results")

print("✓ Unified migration successful!")

Migration with Filtering¶

from semlix.index import open_dir
from semlix.bm25 import create_bm25_index

source = open_dir("all_documents")
target = create_bm25_index("filtered_documents", source.schema)

# Only migrate recent documents
from datetime import datetime, timedelta
cutoff = datetime.now() - timedelta(days=365)

migrated = 0
skipped = 0

with source.searcher() as searcher:
    with target.writer() as writer:
        for docnum in range(searcher.reader().doc_count_all()):
            fields = searcher.stored_fields(docnum)

            # Check date
            if "published" in fields:
                pub_date = fields["published"]
                if isinstance(pub_date, datetime) and pub_date >= cutoff:
                    writer.add_document(**fields)
                    migrated += 1
                else:
                    skipped += 1

            if (migrated + skipped) % 1000 == 0:
                print(f"Processed: {migrated + skipped} "
                      f"(migrated: {migrated}, skipped: {skipped})")

print(f"✓ Filtered migration complete:")
print(f"  Migrated: {migrated}")
print(f"  Skipped: {skipped}")

Best Practices¶

Always backup first
Test on a copy before migrating production data
Verify document counts after migration
Spot check documents to ensure data integrity
Monitor during migration for errors or issues
Keep old index until new one is proven in production
Document your migration process for future reference
Plan for rollback in case of issues

Migration Guide¶

Overview¶

Migration Scenarios¶

FileStorage → BM25Index¶

FileStorage + Vectors → UnifiedIndex¶

NumpyVectorStore → PgVectorStore¶

Quick Start¶

Simple BM25 Migration¶

Simple Unified Migration¶

Detailed Migration¶

Using IndexMigrator¶

Custom Migration¶

Migration Strategies¶

Zero-Downtime Migration¶

Incremental Migration¶

Testing Migration¶

Performance Considerations¶

Migration Speed¶

Optimization¶

Memory Usage¶

Compatibility¶

Schema Compatibility¶

Field Mapping¶

Data Integrity¶

Verification¶

Rollback¶

Backup¶

Common Issues¶

Schema Mismatch¶

Memory Errors¶

PostgreSQL Connection Errors¶

Document Count Mismatch¶

Examples¶

Complete BM25 Migration¶

Complete Unified Migration¶

Migration with Filtering¶

Best Practices¶

See Also¶