chunker
chunker
¶
Type-aware semantic chunker for Deep Research ingestion.
Splits text by paragraph → sentence → token/character boundaries while enforcing a hard size cap and adding a fixed-size overlap between consecutive chunks. The cap is enforced on BOTH token count (whitespace-split) AND character count, since marketing-style emails frequently contain dense runs without whitespace (zero-width joiners, HTML residue) that defeat token-based limits alone.
Splitting strategy by doc_type
event,contact: Always a single chunk; never split, never capped.email: Split on reply boundaries (On … wrote:), then paragraphs, then sentences, then force-split.message: Split on double-newline boundaries, then sentences, then force-split.document,note, anything else : Split on## Headingsection boundaries → paragraph boundaries → sentences → force-split.
Token counting uses whitespace splitting: len(text.split()).
Classes¶
ChunkResult
dataclass
¶
A single chunk produced by SemanticChunker.chunk().
SemanticChunker
¶
SemanticChunker(max_tokens: int = 512, *, max_chars: Optional[int] = None, overlap_tokens: Optional[int] = None)
Split text by document type with a hard size cap and overlap.
| PARAMETER | DESCRIPTION |
|---|---|
max_tokens
|
Hard upper bound on chunk size in whitespace-delimited tokens. No emitted chunk exceeds this.
TYPE:
|
max_chars
|
Hard upper bound on chunk size in characters. Defaults to
TYPE:
|
overlap_tokens
|
Token tail copied from each chunk into the head of the next so
downstream retrieval doesn't miss context that straddles a chunk
boundary. Defaults to
TYPE:
|
Source code in src/openjarvis/connectors/chunker.py
Functions¶
chunk
¶
chunk(text: str, *, doc_type: str = 'document', metadata: Optional[Dict[str, Any]] = None) -> List[ChunkResult]
Split text into ChunkResult objects.
Returns an empty list if text is empty or whitespace-only. Events and contacts are always returned as a single chunk regardless of size; all other types respect the size caps and receive overlap between consecutive chunks.