Unified Index

UnifiedIndex combines high-performance BM25 lexical search with pgvector semantic search in a single, unified interface. This provides the best of both worlds: fast keyword matching and semantic understanding.

Overview

UnifiedIndex automatically manages both a BM25 index for lexical search and a PostgreSQL vector store for semantic search. Documents are indexed in both stores simultaneously, and searches can leverage either or both approaches.

Key Benefits:

  • Hybrid search out-of-the-box: No manual setup required

  • Transactional writes: Atomic updates across both stores

  • Automatic embeddings: Generates vectors during indexing

  • Enhanced features: Faceting, sorting, and phrase queries on hybrid results

  • Production-ready: ACID transactions, scalable PostgreSQL backend

Quick Start

Creating a Unified Index

from semlix.unified import create_unified_index
from semlix.fields import Schema, TEXT, ID, KEYWORD, DATETIME
from semlix.semantic import SentenceTransformerProvider
from semlix.analysis import StandardAnalyzer

# Define schema
schema = Schema(
    id=ID(stored=True),
    title=TEXT(stored=True, analyzer=StandardAnalyzer()),
    content=TEXT(stored=True, analyzer=StandardAnalyzer()),
    author=KEYWORD(stored=True),
    category=KEYWORD(stored=True),
    published=DATETIME(stored=True)
)

# Create embedding provider
embedder = SentenceTransformerProvider("all-MiniLM-L6-v2")

# Create unified index
ix = create_unified_index(
    index_dir="my_unified_index",
    schema=schema,
    connection_string="postgresql://localhost/mydb",
    embedder=embedder
)

Prerequisites

UnifiedIndex requires:

  1. PostgreSQL with pgvector extension:

    # Install extension
    CREATE EXTENSION vector;
    
  2. Python packages:

    pip install bm25s sentence-transformers psycopg2-binary pgvector
    

Indexing Documents

Use the unified writer to add documents to both indexes:

with ix.writer() as writer:
    writer.add_document(
        id="1",
        title="Introduction to Machine Learning",
        content="Machine learning enables systems to learn from data...",
        author="Alice",
        category="ai",
        published="2024-01-15"
    )
    writer.add_document(
        id="2",
        title="Python Programming Guide",
        content="Learn Python programming best practices...",
        author="Bob",
        category="programming",
        published="2024-02-20"
    )

The writer automatically:

  1. Indexes documents in the BM25 index

  2. Generates embeddings for specified fields

  3. Stores vectors in PostgreSQL

  4. Commits both atomically

Searching

Components

UnifiedIndex

The main index class combining BM25 and vector search.

Constructor Parameters:

  • index_dir: Directory for the index

  • schema: Field schema

  • connection_string: PostgreSQL connection URL

  • embedder: Embedding provider

  • id_field: Field containing document IDs (default: “id”)

  • searchable_fields: Fields to use for embeddings (default: all TEXT fields)

Methods:

  • writer(**kwargs): Returns UnifiedWriter

  • searcher(**kwargs): Returns UnifiedSearcher

  • reader(**kwargs): Returns BM25Reader

  • optimize(): Optimizes both indexes

  • doc_count(): Returns document count

  • close(): Closes both stores

UnifiedWriter

Handles transactional writes across both stores:

with ix.writer() as writer:
    # Add document (indexed in both BM25 and vectors)
    writer.add_document(id="1", content="Document text")

    # Update document (deletes old, adds new in both stores)
    writer.update_document(id="1", content="Updated text")

    # Delete document (removes from both stores)
    writer.delete_document(id="1")

    # Delete by query
    from semlix.qparser import QueryParser
    qp = QueryParser("content", ix.schema)
    query = qp.parse("obsolete")
    writer.delete_by_query(query)

Transaction Guarantees:

  • Writes are atomic across both stores

  • If vector storage fails, BM25 changes roll back

  • Automatic embedding generation

  • Configurable batch processing

UnifiedSearcher

Enhanced searcher with hybrid search capabilities:

with ix.searcher() as searcher:
    # Hybrid search
    results = searcher.hybrid_search("query", alpha=0.5)

    # With facets
    results, facets = searcher.search_with_facets(
        "python",
        facet_fields=["category", "author"],
        limit=100
    )

    # Phrase search
    results = searcher.phrase_search(
        "content",
        "machine learning",
        slop=0
    )

    # Sorted search
    results = searcher.search_sorted(
        "python",
        sort_by=[("published", True), ("score", True)],
        limit=10
    )

Methods:

  • hybrid_search(...): Combined lexical + semantic

  • lexical_only(...): BM25 only

  • semantic_only(...): Vector only

  • search_with_facets(...): Hybrid search with aggregations

  • phrase_search(...): Exact phrase matching

  • sort_results(...): Sort existing results

  • search_sorted(...): Search with custom sorting

Advanced Features

Phrase Queries

Find exact phrases in hybrid results:

with ix.searcher() as searcher:
    # Exact phrase
    results = searcher.phrase_search(
        field="content",
        phrase="machine learning",
        slop=0,
        limit=10
    )

    # With slop (allows words in between)
    results = searcher.phrase_search(
        field="content",
        phrase="machine learning",
        slop=2,  # "machine X Y learning" matches
        limit=10
    )

Configuration

Embedding Provider

Choose an embedding model based on your needs:

from semlix.semantic import SentenceTransformerProvider

# Fast and lightweight (384-dim)
embedder = SentenceTransformerProvider("all-MiniLM-L6-v2")

# Better quality (768-dim)
embedder = SentenceTransformerProvider("all-mpnet-base-v2")

# Multilingual
embedder = SentenceTransformerProvider("paraphrase-multilingual-MiniLM-L12-v2")

Vector Store Configuration

Configure PostgreSQL vector storage:

from semlix.semantic.stores import PgVectorStore

vector_store = PgVectorStore(
    connection_string="postgresql://localhost/mydb",
    dimension=384,
    distance_metric="cosine",  # or "l2", "inner_product"
    pool_size=10
)

# Create HNSW index for fast search
vector_store.create_index(
    index_type="hnsw",
    m=16,              # HNSW parameter
    ef_construction=64 # HNSW parameter
)

Searchable Fields

Control which fields are used for embeddings:

ix = create_unified_index(
    index_dir="my_index",
    schema=schema,
    connection_string=pg_url,
    embedder=embedder,
    searchable_fields=["title", "content"]  # Only these fields
)

By default, all TEXT fields are used for embedding generation.

Fusion Methods

Choose how to combine lexical and semantic scores:

from semlix.semantic.fusion import FusionMethod

with ix.searcher() as searcher:
    results = searcher.hybrid_search(
        "query",
        fusion_method=FusionMethod.RRF,  # Reciprocal Rank Fusion
        alpha=0.5
    )

Available Methods:

  • RRF (Reciprocal Rank Fusion): Recommended, parameter-free

  • LINEAR: Weighted linear combination

  • DBSF (Distribution-Based Score Fusion): Normalizes score distributions

  • RELATIVE_SCORE: Relative scoring normalization

Migration

From FileStorage + NumpyVectorStore

Migrate existing indexes to UnifiedIndex:

from semlix.tools import migrate_to_unified
from semlix.semantic import SentenceTransformerProvider

embedder = SentenceTransformerProvider()

migrate_to_unified(
    source_dir="old_whoosh_index",
    target_dir="new_unified_index",
    connection_string="postgresql://localhost/mydb",
    embedder=embedder,
    vector_store_path="old_vectors.pkl",  # Reuse existing vectors
    batch_size=100
)

Migration Process:

  1. Opens source index and vector store

  2. Creates new UnifiedIndex

  3. Migrates documents with embeddings

  4. Reuses existing vectors when available

  5. Generates new vectors for missing documents

  6. Optimizes both indexes

From BM25Index

Add vector search to existing BM25 index:

from semlix.bm25 import open_bm25_index
from semlix.unified import UnifiedIndex
from semlix.semantic import SentenceTransformerProvider
from semlix.semantic.stores import PgVectorStore

# Open existing BM25 index
bm25_ix = open_bm25_index("my_bm25_index")

# Create vector store
embedder = SentenceTransformerProvider()
vector_store = PgVectorStore(
    "postgresql://localhost/mydb",
    dimension=embedder.dimension
)

# Generate embeddings for existing documents
docs = []
with bm25_ix.reader() as reader:
    for doc in reader.iter_docs():
        docs.append(doc)

# Extract text and generate embeddings
texts = [doc.get("content", "") for doc in docs]
doc_ids = [doc.get("id", str(i)) for i, doc in enumerate(docs)]
embeddings = embedder.encode(texts)

# Add to vector store
vector_store.add(doc_ids, embeddings)

# Create unified index
unified_ix = UnifiedIndex(
    index_dir="unified_index",
    schema=bm25_ix.schema,
    connection_string="postgresql://localhost/mydb",
    embedder=embedder,
    bm25_index=bm25_ix,
    vector_store=vector_store
)

Performance

Search Performance

Hybrid Search:

  • 500+ queries/second (10K documents)

  • ~5-10ms latency (p50)

  • Scales well with document count

Lexical-Only:

  • 1000+ queries/second

  • ~1-2ms latency

Semantic-Only:

  • ~100 queries/second (with HNSW index)

  • ~10-20ms latency

Indexing Performance

With Embedding Generation:

  • ~100 documents/second

  • Depends on embedding model speed

  • Can batch for better throughput

Optimization:

Use batch processing for bulk indexing:

batch_size = 100
batch = []

with ix.writer() as writer:
    for doc in documents:
        batch.append(doc)

        if len(batch) >= batch_size:
            for doc_fields in batch:
                writer.add_document(**doc_fields)
            batch = []

Memory Usage

Component

10K docs

100K docs

BM25 Index

100MB

500MB

Vector Store (PG)

40MB

400MB

Total (approx)

140MB

900MB

Disk Usage

Component

10K docs

100K docs

BM25 Index

50MB

250MB

PostgreSQL (total)

100MB

800MB

Total (approx)

150MB

1050MB

Examples

Complete Example with All Features

from semlix.unified import create_unified_index
from semlix.fields import Schema, TEXT, ID, KEYWORD, DATETIME
from semlix.semantic import SentenceTransformerProvider

schema = Schema(
    id=ID(stored=True),
    title=TEXT(stored=True),
    content=TEXT(stored=True),
    category=KEYWORD(stored=True),
    published=DATETIME(stored=True)
)

embedder = SentenceTransformerProvider("all-MiniLM-L6-v2")

ix = create_unified_index(
    "my_index",
    schema,
    "postgresql://localhost/mydb",
    embedder
)

# Index documents
with ix.writer() as writer:
    writer.add_document(
        id="1",
        title="AI Basics",
        content="Introduction to artificial intelligence...",
        category="ai",
        published="2024-01-15"
    )
    # ... more documents ...

# Search with all features
with ix.searcher() as searcher:
    # Hybrid search with facets
    results, facets = searcher.search_with_facets(
        "artificial intelligence",
        facet_fields=["category"],
        limit=50,
        alpha=0.5
    )

    # Sort by date
    sorted_results = searcher.sort_results(
        results,
        [("published", True)]
    )

    # Phrase search
    phrase_results = searcher.phrase_search(
        "content",
        "machine learning"
    )

Best Practices

  1. Choose appropriate alpha:

    • Use alpha=0.3-0.5 for balanced search

    • Use alpha=0.0 for exact keyword matching

    • Use alpha=0.8-1.0 for conceptual/semantic queries

  2. Batch indexing for performance:

    • Index in batches of 100-1000 documents

    • Commit once per batch, not per document

  3. Create HNSW index for vectors:

    • Essential for good semantic search performance

    • Create after bulk indexing:

      ix.optimize()  # Optimizes both BM25 and vector indexes
      
  4. Choose embedding model wisely:

    • Start with all-MiniLM-L6-v2 (fast, good quality)

    • Upgrade to all-mpnet-base-v2 if quality matters more than speed

    • Use multilingual models only if needed

  5. Monitor PostgreSQL:

    • Regular VACUUM ANALYZE

    • Monitor connection pool usage

    • Consider replication for high availability

Troubleshooting

Memory Issues

Problem: High memory usage during indexing

Solutions:

  1. Use smaller batches

  2. Enable memory mapping for BM25:

    from semlix.stores import BM25sStore
    store = BM25sStore.load(index_dir, mmap=True)
    
  3. Reduce connection pool size

Connection Pool Exhausted

Problem: PostgreSQL connection errors

Solutions:

  1. Increase pool size:

    vector_store = PgVectorStore(
        connection_string=pg_url,
        pool_size=50  # Increase from default 10
    )
    
  2. Close searchers when done

  3. Use context managers (with statements)

See Also