hybrid_search
hybrid_search
¶
Hybrid retrieval over the KnowledgeStore: metadata filter + BM25 + vector cosine.
A single search entrypoint that the agentic research loop calls as a tool.
Structured WHERE-clause filters (person, time range, sources) narrow the
candidate set, then BM25 (FTS5) and dense cosine similarity score the
survivors. The two ranks are fused with Reciprocal Rank Fusion, which is
robust to the very different score scales the two signals produce
(BM25 ~ [0, 20], cosine ~ [0.4, 0.9] for nomic-embed-text).
Each result is enriched with its thread context: when a hit belongs to a
thread_id, the surrounding chunks are attached so the synthesis model
sees the conversation, not an isolated fragment.
Brute-force vector scan is fine at the current corpus size (~5k chunks × 768 dims fits in ~15 MB and matmuls in <50 ms). Swap in an ANN index when that stops being true.
Classes¶
SearchHit
dataclass
¶
SearchHit(chunk_id: str, document_id: str, chunk_idx: int, title: str, content_snippet: str, source: str, timestamp: str, participants: List[str], score: float, bm25_score: float, vector_score: float, thread_id: str = '', thread_context: List[Dict[str, Any]] = list(), url: str = '')
A single hybrid-search result with enough context for citation.
HybridSearch
¶
HybridSearch(store: KnowledgeStore, embedder: Optional[OllamaEmbedder] = None, *, bm25_weight: float = 0.5, vector_weight: float = 0.5, rrf_k: int = 60, recall_k: int = 200, thread_context_cap: int = 20)
Hybrid BM25 + dense-cosine retrieval over a KnowledgeStore.
| PARAMETER | DESCRIPTION |
|---|---|
store
|
The store to query.
TYPE:
|
embedder
|
Embedding client used to encode the query. When
TYPE:
|
bm25_weight
|
Weights on the two RRF terms. Defaults to 0.5 / 0.5; raise either to bias retrieval toward lexical or semantic matches.
TYPE:
|
vector_weight
|
Weights on the two RRF terms. Defaults to 0.5 / 0.5; raise either to bias retrieval toward lexical or semantic matches.
TYPE:
|
rrf_k
|
RRF damping constant. Larger values flatten the contribution of deeper ranks; 60 is the canonical value from the original paper.
TYPE:
|
recall_k
|
How deep each individual ranker recalls before fusion. Should be at
least a few times
TYPE:
|
Source code in src/openjarvis/connectors/hybrid_search.py
Functions¶
search
¶
search(query: str, *, person: Optional[str] = None, time_range: Optional[Tuple[Optional[datetime], Optional[datetime]]] = None, sources: Optional[Sequence[str]] = None, limit: int = 20) -> List[SearchHit]
Run the hybrid pipeline and return up to limit hits.
See module docstring for ranking semantics. query may be empty
when callers want a pure metadata filter (e.g. "all mail from X in
May") — in that case only the vector leg runs (and only if an
embedder is configured); if neither leg yields anything the
structured filter is applied directly and the most recent rows are
returned.
Source code in src/openjarvis/connectors/hybrid_search.py
382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 | |