Unified Index¶

UnifiedIndex combines high-performance BM25 lexical search with pgvector semantic search in a single, unified interface. This provides the best of both worlds: fast keyword matching and semantic understanding.

Overview¶

UnifiedIndex automatically manages both a BM25 index for lexical search and a PostgreSQL vector store for semantic search. Documents are indexed in both stores simultaneously, and searches can leverage either or both approaches.

Key Benefits:

Hybrid search out-of-the-box: No manual setup required
Transactional writes: Atomic updates across both stores
Automatic embeddings: Generates vectors during indexing
Enhanced features: Faceting, sorting, and phrase queries on hybrid results
Production-ready: ACID transactions, scalable PostgreSQL backend

Quick Start¶

Creating a Unified Index¶

from semlix.unified import create_unified_index
from semlix.fields import Schema, TEXT, ID, KEYWORD, DATETIME
from semlix.semantic import SentenceTransformerProvider
from semlix.analysis import StandardAnalyzer

# Define schema
schema = Schema(
    id=ID(stored=True),
    title=TEXT(stored=True, analyzer=StandardAnalyzer()),
    content=TEXT(stored=True, analyzer=StandardAnalyzer()),
    author=KEYWORD(stored=True),
    category=KEYWORD(stored=True),
    published=DATETIME(stored=True)
)

# Create embedding provider
embedder = SentenceTransformerProvider("all-MiniLM-L6-v2")

# Create unified index
ix = create_unified_index(
    index_dir="my_unified_index",
    schema=schema,
    connection_string="postgresql://localhost/mydb",
    embedder=embedder
)

Prerequisites¶

UnifiedIndex requires:

PostgreSQL with pgvector extension:

# Install extension
CREATE EXTENSION vector;

Python packages:

pip install bm25s sentence-transformers psycopg2-binary pgvector

Indexing Documents¶

Use the unified writer to add documents to both indexes:

with ix.writer() as writer:
    writer.add_document(
        id="1",
        title="Introduction to Machine Learning",
        content="Machine learning enables systems to learn from data...",
        author="Alice",
        category="ai",
        published="2024-01-15"
    )
    writer.add_document(
        id="2",
        title="Python Programming Guide",
        content="Learn Python programming best practices...",
        author="Bob",
        category="programming",
        published="2024-02-20"
    )

The writer automatically:

Indexes documents in the BM25 index
Generates embeddings for specified fields
Stores vectors in PostgreSQL
Commits both atomically

Searching¶

Hybrid Search¶

Combine lexical and semantic search (default):

with ix.searcher() as searcher:
    results = searcher.hybrid_search(
        "machine learning algorithms",
        limit=10,
        alpha=0.5  # 0=all lexical, 1=all semantic
    )

    for r in results:
        print(f"{r.stored_fields['title']}")
        print(f"  Combined: {r.score:.3f}")
        print(f"  Lexical: {r.lexical_score:.3f}")
        print(f"  Semantic: {r.semantic_score:.3f}")

Alpha Parameter:

alpha=0.0: Pure lexical search (BM25)
alpha=0.5: Balanced hybrid (recommended)
alpha=1.0: Pure semantic search (vector)

Lexical-Only Search¶

Use BM25 only for exact keyword matching:

with ix.searcher() as searcher:
    results = searcher.lexical_only("python programming", limit=10)

Semantic-Only Search¶

Use vectors only for conceptual queries:

with ix.searcher() as searcher:
    # Finds conceptually similar docs even without keyword overlap
    results = searcher.semantic_only("AI and neural networks", limit=10)

Components¶

UnifiedIndex¶

The main index class combining BM25 and vector search.

Constructor Parameters:

index_dir: Directory for the index
schema: Field schema
connection_string: PostgreSQL connection URL
embedder: Embedding provider
id_field: Field containing document IDs (default: “id”)
searchable_fields: Fields to use for embeddings (default: all TEXT fields)

Methods:

writer(**kwargs): Returns UnifiedWriter
searcher(**kwargs): Returns UnifiedSearcher
reader(**kwargs): Returns BM25Reader
optimize(): Optimizes both indexes
doc_count(): Returns document count
close(): Closes both stores

UnifiedWriter¶

Handles transactional writes across both stores:

with ix.writer() as writer:
    # Add document (indexed in both BM25 and vectors)
    writer.add_document(id="1", content="Document text")

    # Update document (deletes old, adds new in both stores)
    writer.update_document(id="1", content="Updated text")

    # Delete document (removes from both stores)
    writer.delete_document(id="1")

    # Delete by query
    from semlix.qparser import QueryParser
    qp = QueryParser("content", ix.schema)
    query = qp.parse("obsolete")
    writer.delete_by_query(query)

Transaction Guarantees:

Writes are atomic across both stores
If vector storage fails, BM25 changes roll back
Automatic embedding generation
Configurable batch processing

UnifiedSearcher¶

Enhanced searcher with hybrid search capabilities:

with ix.searcher() as searcher:
    # Hybrid search
    results = searcher.hybrid_search("query", alpha=0.5)

    # With facets
    results, facets = searcher.search_with_facets(
        "python",
        facet_fields=["category", "author"],
        limit=100
    )

    # Phrase search
    results = searcher.phrase_search(
        "content",
        "machine learning",
        slop=0
    )

    # Sorted search
    results = searcher.search_sorted(
        "python",
        sort_by=[("published", True), ("score", True)],
        limit=10
    )

Methods:

hybrid_search(...): Combined lexical + semantic
lexical_only(...): BM25 only
semantic_only(...): Vector only
search_with_facets(...): Hybrid search with aggregations
phrase_search(...): Exact phrase matching
sort_results(...): Sort existing results
search_sorted(...): Search with custom sorting

Advanced Features¶

Faceted Hybrid Search¶

Combine hybrid search with faceting:

with ix.searcher() as searcher:
    results, facets = searcher.search_with_facets(
        "machine learning",
        facet_fields=["category", "author", "year"],
        limit=100,
        facet_limit=10,
        alpha=0.5
    )

    # Access results
    for r in results[:10]:
        print(r.stored_fields['title'])

    # Access facets
    print("Categories:", facets["category"])
    # {"ai": 45, "programming": 32, "database": 12}

    print("Authors:", facets["author"])
    # {"Alice": 23, "Bob": 18, "Charlie": 15}

Phrase Queries¶

Find exact phrases in hybrid results:

with ix.searcher() as searcher:
    # Exact phrase
    results = searcher.phrase_search(
        field="content",
        phrase="machine learning",
        slop=0,
        limit=10
    )

    # With slop (allows words in between)
    results = searcher.phrase_search(
        field="content",
        phrase="machine learning",
        slop=2,  # "machine X Y learning" matches
        limit=10
    )

Sorted Hybrid Search¶

Sort hybrid results by custom criteria:

with ix.searcher() as searcher:
    # Sort by date (newest first), then by relevance score
    results = searcher.search_sorted(
        "python programming",
        sort_by=[
            ("published", True),   # Descending
            ("score", True)        # Descending
        ],
        limit=20,
        alpha=0.5
    )

    for r in results:
        doc = r.stored_fields
        print(f"{doc['title']} - {doc['published']}")

Configuration¶

Embedding Provider¶

Choose an embedding model based on your needs:

from semlix.semantic import SentenceTransformerProvider

# Fast and lightweight (384-dim)
embedder = SentenceTransformerProvider("all-MiniLM-L6-v2")

# Better quality (768-dim)
embedder = SentenceTransformerProvider("all-mpnet-base-v2")

# Multilingual
embedder = SentenceTransformerProvider("paraphrase-multilingual-MiniLM-L12-v2")

Vector Store Configuration¶

Configure PostgreSQL vector storage:

from semlix.semantic.stores import PgVectorStore

vector_store = PgVectorStore(
    connection_string="postgresql://localhost/mydb",
    dimension=384,
    distance_metric="cosine",  # or "l2", "inner_product"
    pool_size=10
)

# Create HNSW index for fast search
vector_store.create_index(
    index_type="hnsw",
    m=16,              # HNSW parameter
    ef_construction=64 # HNSW parameter
)

Searchable Fields¶

Control which fields are used for embeddings:

ix = create_unified_index(
    index_dir="my_index",
    schema=schema,
    connection_string=pg_url,
    embedder=embedder,
    searchable_fields=["title", "content"]  # Only these fields
)

By default, all TEXT fields are used for embedding generation.

Fusion Methods¶

Choose how to combine lexical and semantic scores:

from semlix.semantic.fusion import FusionMethod

with ix.searcher() as searcher:
    results = searcher.hybrid_search(
        "query",
        fusion_method=FusionMethod.RRF,  # Reciprocal Rank Fusion
        alpha=0.5
    )

Available Methods:

RRF (Reciprocal Rank Fusion): Recommended, parameter-free
LINEAR: Weighted linear combination
DBSF (Distribution-Based Score Fusion): Normalizes score distributions
RELATIVE_SCORE: Relative scoring normalization

Migration¶

From FileStorage + NumpyVectorStore¶

Migrate existing indexes to UnifiedIndex:

from semlix.tools import migrate_to_unified
from semlix.semantic import SentenceTransformerProvider

embedder = SentenceTransformerProvider()

migrate_to_unified(
    source_dir="old_whoosh_index",
    target_dir="new_unified_index",
    connection_string="postgresql://localhost/mydb",
    embedder=embedder,
    vector_store_path="old_vectors.pkl",  # Reuse existing vectors
    batch_size=100
)

Migration Process:

Opens source index and vector store
Creates new UnifiedIndex
Migrates documents with embeddings
Reuses existing vectors when available
Generates new vectors for missing documents
Optimizes both indexes

From BM25Index¶

Add vector search to existing BM25 index:

from semlix.bm25 import open_bm25_index
from semlix.unified import UnifiedIndex
from semlix.semantic import SentenceTransformerProvider
from semlix.semantic.stores import PgVectorStore

# Open existing BM25 index
bm25_ix = open_bm25_index("my_bm25_index")

# Create vector store
embedder = SentenceTransformerProvider()
vector_store = PgVectorStore(
    "postgresql://localhost/mydb",
    dimension=embedder.dimension
)

# Generate embeddings for existing documents
docs = []
with bm25_ix.reader() as reader:
    for doc in reader.iter_docs():
        docs.append(doc)

# Extract text and generate embeddings
texts = [doc.get("content", "") for doc in docs]
doc_ids = [doc.get("id", str(i)) for i, doc in enumerate(docs)]
embeddings = embedder.encode(texts)

# Add to vector store
vector_store.add(doc_ids, embeddings)

# Create unified index
unified_ix = UnifiedIndex(
    index_dir="unified_index",
    schema=bm25_ix.schema,
    connection_string="postgresql://localhost/mydb",
    embedder=embedder,
    bm25_index=bm25_ix,
    vector_store=vector_store
)

Performance¶

Search Performance¶

Hybrid Search:

500+ queries/second (10K documents)
~5-10ms latency (p50)
Scales well with document count

Lexical-Only:

1000+ queries/second
~1-2ms latency

Semantic-Only:

~100 queries/second (with HNSW index)
~10-20ms latency

Indexing Performance¶

With Embedding Generation:

~100 documents/second
Depends on embedding model speed
Can batch for better throughput

Optimization:

Use batch processing for bulk indexing:

batch_size = 100
batch = []

with ix.writer() as writer:
    for doc in documents:
        batch.append(doc)

        if len(batch) >= batch_size:
            for doc_fields in batch:
                writer.add_document(**doc_fields)
            batch = []

Memory Usage¶

Component	10K docs	100K docs
BM25 Index	100MB	500MB
Vector Store (PG)	40MB	400MB
Total (approx)	140MB	900MB

Disk Usage¶

Component	10K docs	100K docs
BM25 Index	50MB	250MB
PostgreSQL (total)	100MB	800MB
Total (approx)	150MB	1050MB

Examples¶

Basic Hybrid Search¶

from semlix.unified import create_unified_index
from semlix.fields import Schema, TEXT, ID
from semlix.semantic import SentenceTransformerProvider

schema = Schema(id=ID(stored=True), content=TEXT(stored=True))
embedder = SentenceTransformerProvider()

ix = create_unified_index(
    "my_index",
    schema,
    "postgresql://localhost/mydb",
    embedder
)

# Index
with ix.writer() as writer:
    writer.add_document(
        id="1",
        content="Python is a programming language"
    )
    writer.add_document(
        id="2",
        content="Machine learning uses neural networks"
    )

# Search
with ix.searcher() as searcher:
    # Hybrid: finds both keyword and semantic matches
    results = searcher.hybrid_search("coding in python", limit=10)

Complete Example with All Features¶

from semlix.unified import create_unified_index
from semlix.fields import Schema, TEXT, ID, KEYWORD, DATETIME
from semlix.semantic import SentenceTransformerProvider

schema = Schema(
    id=ID(stored=True),
    title=TEXT(stored=True),
    content=TEXT(stored=True),
    category=KEYWORD(stored=True),
    published=DATETIME(stored=True)
)

embedder = SentenceTransformerProvider("all-MiniLM-L6-v2")

ix = create_unified_index(
    "my_index",
    schema,
    "postgresql://localhost/mydb",
    embedder
)

# Index documents
with ix.writer() as writer:
    writer.add_document(
        id="1",
        title="AI Basics",
        content="Introduction to artificial intelligence...",
        category="ai",
        published="2024-01-15"
    )
    # ... more documents ...

# Search with all features
with ix.searcher() as searcher:
    # Hybrid search with facets
    results, facets = searcher.search_with_facets(
        "artificial intelligence",
        facet_fields=["category"],
        limit=50,
        alpha=0.5
    )

    # Sort by date
    sorted_results = searcher.sort_results(
        results,
        [("published", True)]
    )

    # Phrase search
    phrase_results = searcher.phrase_search(
        "content",
        "machine learning"
    )

Best Practices¶

Choose appropriate alpha:
- Use alpha=0.3-0.5 for balanced search
- Use alpha=0.0 for exact keyword matching
- Use alpha=0.8-1.0 for conceptual/semantic queries
Batch indexing for performance:
- Index in batches of 100-1000 documents
- Commit once per batch, not per document
Create HNSW index for vectors:
- Essential for good semantic search performance
- Create after bulk indexing:
```
ix.optimize()  # Optimizes both BM25 and vector indexes
```
Choose embedding model wisely:
- Start with all-MiniLM-L6-v2 (fast, good quality)
- Upgrade to all-mpnet-base-v2 if quality matters more than speed
- Use multilingual models only if needed
Monitor PostgreSQL:
- Regular VACUUM ANALYZE
- Monitor connection pool usage
- Consider replication for high availability

Troubleshooting¶

Slow Semantic Search¶

Problem: Vector search is slow (>100ms per query)

Solutions:

Create HNSW index:

ix.vector_store.create_index(index_type="hnsw")

Tune HNSW parameters:

ix.vector_store.create_index(
    index_type="hnsw",
    m=32,              # Higher = better quality, slower build
    ef_construction=128 # Higher = better quality, slower build
)

Memory Issues¶

Problem: High memory usage during indexing

Solutions:

Use smaller batches

Enable memory mapping for BM25:

from semlix.stores import BM25sStore
store = BM25sStore.load(index_dir, mmap=True)

Reduce connection pool size

Connection Pool Exhausted¶

Problem: PostgreSQL connection errors

Solutions:

Increase pool size:

vector_store = PgVectorStore(
    connection_string=pg_url,
    pool_size=50  # Increase from default 10
)

Close searchers when done
Use context managers (with statements)

Unified Index¶

Overview¶

Quick Start¶

Creating a Unified Index¶

Prerequisites¶

Indexing Documents¶

Searching¶

Hybrid Search¶

Lexical-Only Search¶

Semantic-Only Search¶

Components¶

UnifiedIndex¶

UnifiedWriter¶

UnifiedSearcher¶

Advanced Features¶

Faceted Hybrid Search¶

Phrase Queries¶

Sorted Hybrid Search¶

Configuration¶

Embedding Provider¶

Vector Store Configuration¶

Searchable Fields¶

Fusion Methods¶

Migration¶

From FileStorage + NumpyVectorStore¶

From BM25Index¶

Performance¶

Search Performance¶

Indexing Performance¶

Memory Usage¶

Disk Usage¶

Examples¶

Basic Hybrid Search¶

Complete Example with All Features¶

Best Practices¶

Troubleshooting¶

Slow Semantic Search¶

Memory Issues¶

Connection Pool Exhausted¶

See Also¶