Unified Index¶
UnifiedIndex combines high-performance BM25 lexical search with pgvector semantic search in a single, unified interface. This provides the best of both worlds: fast keyword matching and semantic understanding.
Overview¶
UnifiedIndex automatically manages both a BM25 index for lexical search and a PostgreSQL vector store for semantic search. Documents are indexed in both stores simultaneously, and searches can leverage either or both approaches.
Key Benefits:
Hybrid search out-of-the-box: No manual setup required
Transactional writes: Atomic updates across both stores
Automatic embeddings: Generates vectors during indexing
Enhanced features: Faceting, sorting, and phrase queries on hybrid results
Production-ready: ACID transactions, scalable PostgreSQL backend
Quick Start¶
Creating a Unified Index¶
from semlix.unified import create_unified_index
from semlix.fields import Schema, TEXT, ID, KEYWORD, DATETIME
from semlix.semantic import SentenceTransformerProvider
from semlix.analysis import StandardAnalyzer
# Define schema
schema = Schema(
id=ID(stored=True),
title=TEXT(stored=True, analyzer=StandardAnalyzer()),
content=TEXT(stored=True, analyzer=StandardAnalyzer()),
author=KEYWORD(stored=True),
category=KEYWORD(stored=True),
published=DATETIME(stored=True)
)
# Create embedding provider
embedder = SentenceTransformerProvider("all-MiniLM-L6-v2")
# Create unified index
ix = create_unified_index(
index_dir="my_unified_index",
schema=schema,
connection_string="postgresql://localhost/mydb",
embedder=embedder
)
Prerequisites¶
UnifiedIndex requires:
PostgreSQL with pgvector extension:
# Install extension CREATE EXTENSION vector;
Python packages:
pip install bm25s sentence-transformers psycopg2-binary pgvector
Indexing Documents¶
Use the unified writer to add documents to both indexes:
with ix.writer() as writer:
writer.add_document(
id="1",
title="Introduction to Machine Learning",
content="Machine learning enables systems to learn from data...",
author="Alice",
category="ai",
published="2024-01-15"
)
writer.add_document(
id="2",
title="Python Programming Guide",
content="Learn Python programming best practices...",
author="Bob",
category="programming",
published="2024-02-20"
)
The writer automatically:
Indexes documents in the BM25 index
Generates embeddings for specified fields
Stores vectors in PostgreSQL
Commits both atomically
Searching¶
Hybrid Search¶
Combine lexical and semantic search (default):
with ix.searcher() as searcher:
results = searcher.hybrid_search(
"machine learning algorithms",
limit=10,
alpha=0.5 # 0=all lexical, 1=all semantic
)
for r in results:
print(f"{r.stored_fields['title']}")
print(f" Combined: {r.score:.3f}")
print(f" Lexical: {r.lexical_score:.3f}")
print(f" Semantic: {r.semantic_score:.3f}")
Alpha Parameter:
alpha=0.0: Pure lexical search (BM25)alpha=0.5: Balanced hybrid (recommended)alpha=1.0: Pure semantic search (vector)
Lexical-Only Search¶
Use BM25 only for exact keyword matching:
with ix.searcher() as searcher:
results = searcher.lexical_only("python programming", limit=10)
Semantic-Only Search¶
Use vectors only for conceptual queries:
with ix.searcher() as searcher:
# Finds conceptually similar docs even without keyword overlap
results = searcher.semantic_only("AI and neural networks", limit=10)
Components¶
UnifiedIndex¶
The main index class combining BM25 and vector search.
Constructor Parameters:
index_dir: Directory for the indexschema: Field schemaconnection_string: PostgreSQL connection URLembedder: Embedding providerid_field: Field containing document IDs (default: “id”)searchable_fields: Fields to use for embeddings (default: all TEXT fields)
Methods:
writer(**kwargs): Returns UnifiedWritersearcher(**kwargs): Returns UnifiedSearcherreader(**kwargs): Returns BM25Readeroptimize(): Optimizes both indexesdoc_count(): Returns document countclose(): Closes both stores
UnifiedWriter¶
Handles transactional writes across both stores:
with ix.writer() as writer:
# Add document (indexed in both BM25 and vectors)
writer.add_document(id="1", content="Document text")
# Update document (deletes old, adds new in both stores)
writer.update_document(id="1", content="Updated text")
# Delete document (removes from both stores)
writer.delete_document(id="1")
# Delete by query
from semlix.qparser import QueryParser
qp = QueryParser("content", ix.schema)
query = qp.parse("obsolete")
writer.delete_by_query(query)
Transaction Guarantees:
Writes are atomic across both stores
If vector storage fails, BM25 changes roll back
Automatic embedding generation
Configurable batch processing
UnifiedSearcher¶
Enhanced searcher with hybrid search capabilities:
with ix.searcher() as searcher:
# Hybrid search
results = searcher.hybrid_search("query", alpha=0.5)
# With facets
results, facets = searcher.search_with_facets(
"python",
facet_fields=["category", "author"],
limit=100
)
# Phrase search
results = searcher.phrase_search(
"content",
"machine learning",
slop=0
)
# Sorted search
results = searcher.search_sorted(
"python",
sort_by=[("published", True), ("score", True)],
limit=10
)
Methods:
hybrid_search(...): Combined lexical + semanticlexical_only(...): BM25 onlysemantic_only(...): Vector onlysearch_with_facets(...): Hybrid search with aggregationsphrase_search(...): Exact phrase matchingsort_results(...): Sort existing resultssearch_sorted(...): Search with custom sorting
Advanced Features¶
Faceted Hybrid Search¶
Combine hybrid search with faceting:
with ix.searcher() as searcher:
results, facets = searcher.search_with_facets(
"machine learning",
facet_fields=["category", "author", "year"],
limit=100,
facet_limit=10,
alpha=0.5
)
# Access results
for r in results[:10]:
print(r.stored_fields['title'])
# Access facets
print("Categories:", facets["category"])
# {"ai": 45, "programming": 32, "database": 12}
print("Authors:", facets["author"])
# {"Alice": 23, "Bob": 18, "Charlie": 15}
Phrase Queries¶
Find exact phrases in hybrid results:
with ix.searcher() as searcher:
# Exact phrase
results = searcher.phrase_search(
field="content",
phrase="machine learning",
slop=0,
limit=10
)
# With slop (allows words in between)
results = searcher.phrase_search(
field="content",
phrase="machine learning",
slop=2, # "machine X Y learning" matches
limit=10
)
Sorted Hybrid Search¶
Sort hybrid results by custom criteria:
with ix.searcher() as searcher:
# Sort by date (newest first), then by relevance score
results = searcher.search_sorted(
"python programming",
sort_by=[
("published", True), # Descending
("score", True) # Descending
],
limit=20,
alpha=0.5
)
for r in results:
doc = r.stored_fields
print(f"{doc['title']} - {doc['published']}")
Configuration¶
Embedding Provider¶
Choose an embedding model based on your needs:
from semlix.semantic import SentenceTransformerProvider
# Fast and lightweight (384-dim)
embedder = SentenceTransformerProvider("all-MiniLM-L6-v2")
# Better quality (768-dim)
embedder = SentenceTransformerProvider("all-mpnet-base-v2")
# Multilingual
embedder = SentenceTransformerProvider("paraphrase-multilingual-MiniLM-L12-v2")
Vector Store Configuration¶
Configure PostgreSQL vector storage:
from semlix.semantic.stores import PgVectorStore
vector_store = PgVectorStore(
connection_string="postgresql://localhost/mydb",
dimension=384,
distance_metric="cosine", # or "l2", "inner_product"
pool_size=10
)
# Create HNSW index for fast search
vector_store.create_index(
index_type="hnsw",
m=16, # HNSW parameter
ef_construction=64 # HNSW parameter
)
Searchable Fields¶
Control which fields are used for embeddings:
ix = create_unified_index(
index_dir="my_index",
schema=schema,
connection_string=pg_url,
embedder=embedder,
searchable_fields=["title", "content"] # Only these fields
)
By default, all TEXT fields are used for embedding generation.
Fusion Methods¶
Choose how to combine lexical and semantic scores:
from semlix.semantic.fusion import FusionMethod
with ix.searcher() as searcher:
results = searcher.hybrid_search(
"query",
fusion_method=FusionMethod.RRF, # Reciprocal Rank Fusion
alpha=0.5
)
Available Methods:
RRF(Reciprocal Rank Fusion): Recommended, parameter-freeLINEAR: Weighted linear combinationDBSF(Distribution-Based Score Fusion): Normalizes score distributionsRELATIVE_SCORE: Relative scoring normalization
Migration¶
From FileStorage + NumpyVectorStore¶
Migrate existing indexes to UnifiedIndex:
from semlix.tools import migrate_to_unified
from semlix.semantic import SentenceTransformerProvider
embedder = SentenceTransformerProvider()
migrate_to_unified(
source_dir="old_whoosh_index",
target_dir="new_unified_index",
connection_string="postgresql://localhost/mydb",
embedder=embedder,
vector_store_path="old_vectors.pkl", # Reuse existing vectors
batch_size=100
)
Migration Process:
Opens source index and vector store
Creates new UnifiedIndex
Migrates documents with embeddings
Reuses existing vectors when available
Generates new vectors for missing documents
Optimizes both indexes
From BM25Index¶
Add vector search to existing BM25 index:
from semlix.bm25 import open_bm25_index
from semlix.unified import UnifiedIndex
from semlix.semantic import SentenceTransformerProvider
from semlix.semantic.stores import PgVectorStore
# Open existing BM25 index
bm25_ix = open_bm25_index("my_bm25_index")
# Create vector store
embedder = SentenceTransformerProvider()
vector_store = PgVectorStore(
"postgresql://localhost/mydb",
dimension=embedder.dimension
)
# Generate embeddings for existing documents
docs = []
with bm25_ix.reader() as reader:
for doc in reader.iter_docs():
docs.append(doc)
# Extract text and generate embeddings
texts = [doc.get("content", "") for doc in docs]
doc_ids = [doc.get("id", str(i)) for i, doc in enumerate(docs)]
embeddings = embedder.encode(texts)
# Add to vector store
vector_store.add(doc_ids, embeddings)
# Create unified index
unified_ix = UnifiedIndex(
index_dir="unified_index",
schema=bm25_ix.schema,
connection_string="postgresql://localhost/mydb",
embedder=embedder,
bm25_index=bm25_ix,
vector_store=vector_store
)
Performance¶
Search Performance¶
Hybrid Search:
500+ queries/second (10K documents)
~5-10ms latency (p50)
Scales well with document count
Lexical-Only:
1000+ queries/second
~1-2ms latency
Semantic-Only:
~100 queries/second (with HNSW index)
~10-20ms latency
Indexing Performance¶
With Embedding Generation:
~100 documents/second
Depends on embedding model speed
Can batch for better throughput
Optimization:
Use batch processing for bulk indexing:
batch_size = 100
batch = []
with ix.writer() as writer:
for doc in documents:
batch.append(doc)
if len(batch) >= batch_size:
for doc_fields in batch:
writer.add_document(**doc_fields)
batch = []
Memory Usage¶
Component |
10K docs |
100K docs |
|---|---|---|
BM25 Index |
100MB |
500MB |
Vector Store (PG) |
40MB |
400MB |
Total (approx) |
140MB |
900MB |
Disk Usage¶
Component |
10K docs |
100K docs |
|---|---|---|
BM25 Index |
50MB |
250MB |
PostgreSQL (total) |
100MB |
800MB |
Total (approx) |
150MB |
1050MB |
Examples¶
Basic Hybrid Search¶
from semlix.unified import create_unified_index
from semlix.fields import Schema, TEXT, ID
from semlix.semantic import SentenceTransformerProvider
schema = Schema(id=ID(stored=True), content=TEXT(stored=True))
embedder = SentenceTransformerProvider()
ix = create_unified_index(
"my_index",
schema,
"postgresql://localhost/mydb",
embedder
)
# Index
with ix.writer() as writer:
writer.add_document(
id="1",
content="Python is a programming language"
)
writer.add_document(
id="2",
content="Machine learning uses neural networks"
)
# Search
with ix.searcher() as searcher:
# Hybrid: finds both keyword and semantic matches
results = searcher.hybrid_search("coding in python", limit=10)
Complete Example with All Features¶
from semlix.unified import create_unified_index
from semlix.fields import Schema, TEXT, ID, KEYWORD, DATETIME
from semlix.semantic import SentenceTransformerProvider
schema = Schema(
id=ID(stored=True),
title=TEXT(stored=True),
content=TEXT(stored=True),
category=KEYWORD(stored=True),
published=DATETIME(stored=True)
)
embedder = SentenceTransformerProvider("all-MiniLM-L6-v2")
ix = create_unified_index(
"my_index",
schema,
"postgresql://localhost/mydb",
embedder
)
# Index documents
with ix.writer() as writer:
writer.add_document(
id="1",
title="AI Basics",
content="Introduction to artificial intelligence...",
category="ai",
published="2024-01-15"
)
# ... more documents ...
# Search with all features
with ix.searcher() as searcher:
# Hybrid search with facets
results, facets = searcher.search_with_facets(
"artificial intelligence",
facet_fields=["category"],
limit=50,
alpha=0.5
)
# Sort by date
sorted_results = searcher.sort_results(
results,
[("published", True)]
)
# Phrase search
phrase_results = searcher.phrase_search(
"content",
"machine learning"
)
Best Practices¶
Choose appropriate alpha:
Use
alpha=0.3-0.5for balanced searchUse
alpha=0.0for exact keyword matchingUse
alpha=0.8-1.0for conceptual/semantic queries
Batch indexing for performance:
Index in batches of 100-1000 documents
Commit once per batch, not per document
Create HNSW index for vectors:
Essential for good semantic search performance
Create after bulk indexing:
ix.optimize() # Optimizes both BM25 and vector indexes
Choose embedding model wisely:
Start with
all-MiniLM-L6-v2(fast, good quality)Upgrade to
all-mpnet-base-v2if quality matters more than speedUse multilingual models only if needed
Monitor PostgreSQL:
Regular VACUUM ANALYZE
Monitor connection pool usage
Consider replication for high availability
Troubleshooting¶
Slow Semantic Search¶
Problem: Vector search is slow (>100ms per query)
Solutions:
Create HNSW index:
ix.vector_store.create_index(index_type="hnsw")
Tune HNSW parameters:
ix.vector_store.create_index( index_type="hnsw", m=32, # Higher = better quality, slower build ef_construction=128 # Higher = better quality, slower build )
Memory Issues¶
Problem: High memory usage during indexing
Solutions:
Use smaller batches
Enable memory mapping for BM25:
from semlix.stores import BM25sStore store = BM25sStore.load(index_dir, mmap=True)
Reduce connection pool size
Connection Pool Exhausted¶
Problem: PostgreSQL connection errors
Solutions:
Increase pool size:
vector_store = PgVectorStore( connection_string=pg_url, pool_size=50 # Increase from default 10 )
Close searchers when done
Use context managers (
withstatements)
See Also¶
BM25 Index - BM25 index documentation
Semantic Search - Semantic search and HybridSearcher
How to index documents - General indexing concepts
How to search - Search and query syntax