================== Semantic Search ================== semlix includes optional semantic search capabilities that enable hybrid search, combining traditional lexical matching (BM25/TF-IDF) with modern vector-based semantic similarity. This allows queries to match documents based on meaning, not just keywords. Overview ======== Semantic search uses dense vector embeddings to understand the semantic meaning of text. When combined with semlix's traditional lexical search (inherited from semlix), you get hybrid search that leverages the strengths of both approaches: * **Lexical search** (BM25/TF-IDF): Excellent for exact keyword matching, phrase queries, and structured searches * **Semantic search** (vector embeddings): Understands meaning and context, finds relevant documents even without keyword overlap Hybrid search combines both approaches using result fusion algorithms to produce superior results compared to either method alone. Quick Start =========== Here's a minimal example of using semantic search:: from semlix.index import create_in from semlix.fields import Schema, TEXT, ID from semlix.semantic import ( HybridSearcher, HybridIndexWriter, SentenceTransformerProvider ) from semlix.semantic.stores import NumpyVectorStore from pathlib import Path # Create schema and index schema = Schema( id=ID(stored=True, unique=True), title=TEXT(stored=True), content=TEXT(stored=True) ) ix = create_in("my_index", schema) # Create semantic components embedder = SentenceTransformerProvider("all-MiniLM-L6-v2") vector_store = NumpyVectorStore(dimension=embedder.dimension) # Index documents with HybridIndexWriter( ix, vector_store, embedder, embedding_field="content", id_field="id" ) as writer: writer.add_document( id="1", title="Python Tutorial", content="Learn Python programming basics and syntax" ) writer.add_document( id="2", title="Authentication Guide", content="How to fix login and authentication issues" ) # Save vector store vector_store.save("vectors.pkl") # Search searcher = HybridSearcher( index=ix, vector_store=vector_store, embedding_provider=embedder, alpha=0.5 # 50% lexical, 50% semantic ) # This query will match "Authentication Guide" even without keyword overlap results = searcher.search("password problems", limit=10) for r in results: print(f"{r['title']}: {r.score:.4f}") Components ========== The semantic search module consists of several key components: Embedding Providers ------------------- Embedding providers generate dense vector representations of text. semlix supports multiple providers: **SentenceTransformerProvider** (Recommended for local use):: from semlix.semantic import SentenceTransformerProvider embedder = SentenceTransformerProvider("all-MiniLM-L6-v2") # Popular models: # - "all-MiniLM-L6-v2": Fast, good quality (384 dim) # - "all-mpnet-base-v2": Higher quality (768 dim) # - "multi-qa-MiniLM-L6-dot-v1": Optimized for QA (384 dim) Requires: ``pip install sentence-transformers`` **OpenAIProvider** (For cloud-based embeddings):: from semlix.semantic import OpenAIProvider embedder = OpenAIProvider( model="text-embedding-3-small", api_key="your-api-key" # or set OPENAI_API_KEY env var ) Requires: ``pip install openai`` **CohereProvider**:: from semlix.semantic import CohereProvider embedder = CohereProvider( model="embed-english-v3.0", api_key="your-api-key" # or set CO_API_KEY env var ) Requires: ``pip install cohere`` **HuggingFaceInferenceProvider**:: from semlix.semantic import HuggingFaceInferenceProvider embedder = HuggingFaceInferenceProvider( model="sentence-transformers/all-MiniLM-L6-v2", api_key="your-token" # or set HF_TOKEN env var ) Requires: ``pip install huggingface_hub`` Vector Stores ------------- Vector stores persist and search embeddings. Choose based on your dataset size: **NumpyVectorStore** (Pure-Python, < 100k vectors):: from semlix.semantic.stores import NumpyVectorStore store = NumpyVectorStore(dimension=384) store.add(doc_ids, embeddings, metadata) results = store.search(query_embedding, k=10) store.save("vectors.pkl") loaded = NumpyVectorStore.load("vectors.pkl") No additional dependencies required (uses NumPy). **FaissVectorStore** (High-performance, millions of vectors):: from semlix.semantic.stores import FaissVectorStore # Flat index for exact search store = FaissVectorStore(dimension=384, index_type="Flat") # IVF index for approximate search (requires training) store = FaissVectorStore(dimension=384, index_type="IVF", nlist=100) store.train(training_embeddings) # Train on representative sample store.add(doc_ids, embeddings) # HNSW index for best speed store = FaissVectorStore(dimension=384, index_type="HNSW") Requires: ``pip install faiss-cpu`` (or ``faiss-gpu`` for GPU support) Hybrid Index Writer -------------------- The :class:`~semlix.semantic.HybridIndexWriter` maintains both a semlix index and a vector store in sync:: from semlix.semantic import HybridIndexWriter writer = HybridIndexWriter( index=ix, vector_store=vector_store, embedding_provider=embedder, embedding_field="content", # Field to generate embeddings from id_field="id", # Field containing document ID batch_size=100 # Batch size for embedding generation ) # Use as context manager with writer: writer.add_document(id="1", title="Doc 1", content="Text to embed") writer.add_document(id="2", title="Doc 2", content="More text") # Or manually writer.add_document(id="3", content="Another document") writer.commit() The writer automatically: * Adds documents to the semlix index * Generates embeddings in batches * Adds embeddings to the vector store Hybrid Searcher --------------- The :class:`~semlix.semantic.HybridSearcher` combines lexical and semantic search:: from semlix.semantic import HybridSearcher, FusionMethod searcher = HybridSearcher( index=ix, vector_store=vector_store, embedding_provider=embedder, default_field="content", id_field="id", alpha=0.5, # Weight for semantic (0=lexical, 1=semantic) fusion_method=FusionMethod.RRF, # Result fusion algorithm rrf_k=60 # RRF constant ) # Hybrid search (combines both) results = searcher.search("query text", limit=10) # Lexical-only search (traditional semlix/Whoosh) results = searcher.search_lexical_only("exact keywords", limit=10) # Semantic-only search results = searcher.search_semantic_only("conceptual query", limit=10) # Adjust balance per query results = searcher.search("query", alpha=0.8) # Prefer semantic Result Fusion ------------- Result fusion combines rankings from lexical and semantic search. Available methods: **RRF (Reciprocal Rank Fusion)** - Recommended:: searcher = HybridSearcher(..., fusion_method=FusionMethod.RRF, rrf_k=60) Robust to score scale differences, rank-based (doesn't depend on raw scores). **Linear Fusion**:: searcher = HybridSearcher(..., fusion_method=FusionMethod.LINEAR) Simple weighted combination: ``combined = (1-alpha) * lexical + alpha * semantic`` **DBSF (Distribution-Based Score Fusion)**:: searcher = HybridSearcher(..., fusion_method=FusionMethod.DBSF) Z-score normalization before combining, handles different score distributions. **Relative Score Fusion**:: searcher = HybridSearcher(..., fusion_method=FusionMethod.RELATIVE_SCORE) Percentile-based normalization, robust to outliers. Advanced Usage ============== Building Vector Store from Existing Index ------------------------------------------ If you have an existing semlix/semlix index, you can build a vector store from it:: from semlix.index import open_dir from semlix.semantic import build_vector_store_from_index from semlix.semantic import SentenceTransformerProvider from semlix.semantic.stores import NumpyVectorStore # Open existing index ix = open_dir("my_existing_index") # Create semantic components embedder = SentenceTransformerProvider() vector_store = NumpyVectorStore(dimension=embedder.dimension) # Build vector store from index count = build_vector_store_from_index( index=ix, vector_store=vector_store, embedding_provider=embedder, embedding_field="content", id_field="id", show_progress=True ) print(f"Indexed {count} documents") vector_store.save("vectors.pkl") Using FAISS for Large Datasets ------------------------------- For datasets with millions of documents, use FAISS with an approximate index:: from semlix.semantic.stores import FaissVectorStore import numpy as np # Create IVF index vector_store = FaissVectorStore( dimension=384, index_type="IVF", nlist=1000, # Number of clusters nprobe=50 # Clusters to search ) # Train on representative sample (10% of data) sample_size = len(all_texts) // 10 sample_texts = all_texts[:sample_size] sample_embeddings = embedder.encode(sample_texts) vector_store.train(sample_embeddings) # Add all documents all_embeddings = embedder.encode(all_texts, show_progress=True) vector_store.add(all_doc_ids, all_embeddings) # Save vector_store.save("large_vectors.faiss") Adjusting Search Balance ------------------------ The ``alpha`` parameter controls the balance between lexical and semantic search: * ``alpha=0.0``: Pure lexical search (traditional semlix/Whoosh) * ``alpha=0.5``: Balanced hybrid search (default) * ``alpha=1.0``: Pure semantic search You can adjust per query:: # Prefer lexical for exact keyword queries results = searcher.search("error code 404", alpha=0.2) # Prefer semantic for conceptual queries results = searcher.search("ways to improve performance", alpha=0.8) Performance Considerations ========================== Dataset Size Recommendations ---------------------------- **Small datasets (< 10K docs)**: Use ``NumpyVectorStore`` - Pure Python, no dependencies **Medium datasets (10K - 100K)**: Use ``FaissVectorStore`` with ``index_type="Flat"`` - Exact search, fast enough **Large datasets (100K - 1M)**: Use ``FaissVectorStore`` with ``index_type="IVF"`` - Approximate, tunable accuracy **Very large datasets (> 1M)**: Use ``FaissVectorStore`` with ``index_type="HNSW"`` - Best speed/accuracy tradeoff Embedding Caching ----------------- For production, consider caching embeddings to avoid recomputing:: from functools import lru_cache import numpy as np class CachedEmbedder: def __init__(self, provider, cache_size=10000): self._provider = provider self._cache = {} self._cache_size = cache_size @property def dimension(self): return self._provider.dimension def encode(self, texts, **kwargs): # Check cache uncached = [] uncached_indices = [] results = [None] * len(texts) for i, text in enumerate(texts): if text in self._cache: results[i] = self._cache[text] else: uncached.append(text) uncached_indices.append(i) # Encode uncached texts if uncached: embeddings = self._provider.encode(uncached, **kwargs) for text, emb, idx in zip(uncached, embeddings, uncached_indices): if len(self._cache) < self._cache_size: self._cache[text] = emb results[idx] = emb return np.array(results) Migration Guide =============== Existing Whoosh/semlix users can add semantic search without modifying existing code. Your current indexes and queries continue to work as before. To add semantic capabilities: 1. Install semantic dependencies:: pip install semlix[semantic] 2. Create a vector store from your existing index (see "Building Vector Store from Existing Index" above) 3. Use HybridSearcher for new queries, or continue using traditional searchers for existing code Example:: # Existing code (still works) from semlix.index import open_dir from semlix.qparser import QueryParser ix = open_dir("my_index") with ix.searcher() as s: q = QueryParser("content", ix.schema).parse("search query") results = s.search(q) # Traditional search # New semantic search (optional) from semlix.semantic import HybridSearcher, SentenceTransformerProvider from semlix.semantic.stores import NumpyVectorStore embedder = SentenceTransformerProvider() vector_store = NumpyVectorStore.load("vectors.pkl") searcher = HybridSearcher(ix, vector_store, embedder) results = searcher.search("search query") # Hybrid search API Reference ============= The main classes and functions are documented below. For complete API details, see the source code or use Python's ``help()`` function. Main Classes: * :class:`~semlix.semantic.HybridSearcher` - Main search interface * :class:`~semlix.semantic.HybridIndexWriter` - Index writer for hybrid search * :class:`~semlix.semantic.SentenceTransformerProvider` - Local embedding provider * :class:`~semlix.semantic.stores.NumpyVectorStore` - Pure-Python vector store * :class:`~semlix.semantic.stores.FaissVectorStore` - High-performance vector store .. note:: The semantic search module is optional and requires additional dependencies. See the installation instructions above.