chunker

chunker ¶

Type-aware semantic chunker for Deep Research ingestion.

Splits text by paragraph → sentence → token/character boundaries while enforcing a hard size cap and adding a fixed-size overlap between consecutive chunks. The cap is enforced on BOTH token count (whitespace-split) AND character count, since marketing-style emails frequently contain dense runs without whitespace (zero-width joiners, HTML residue) that defeat token-based limits alone.

Splitting strategy by doc_type

event, contact : Always a single chunk; never split, never capped.
email : Split on reply boundaries (On … wrote:), then paragraphs, then sentences, then force-split.
message : Split on double-newline boundaries, then sentences, then force-split.
document, note, anything else : Split on ## Heading section boundaries → paragraph boundaries → sentences → force-split.

Token counting uses whitespace splitting: len(text.split()).

Classes¶

ChunkResult `dataclass` ¶

ChunkResult(content: str, index: int = 0, metadata: Dict[str, Any] = dict())

A single chunk produced by SemanticChunker.chunk().

SemanticChunker ¶

SemanticChunker(max_tokens: int = 512, *, max_chars: Optional[int] = None, overlap_tokens: Optional[int] = None)

Split text by document type with a hard size cap and overlap.

PARAMETER	DESCRIPTION
`max_tokens`	Hard upper bound on chunk size in whitespace-delimited tokens. No emitted chunk exceeds this. TYPE: `int` DEFAULT: `512`
`max_chars`	Hard upper bound on chunk size in characters. Defaults to `max_tokens * 4` (a typical English chars-per-token estimate). No emitted chunk exceeds this — the chunker force-splits any run that does, even when token count alone would say it fits. TYPE: `Optional[int]` DEFAULT: `None`
`overlap_tokens`	Token tail copied from each chunk into the head of the next so downstream retrieval doesn't miss context that straddles a chunk boundary. Defaults to `min(100, max_tokens // 5)` and is clamped to `[0, max_tokens - 1]`. Set to `0` to disable. TYPE: `Optional[int]` DEFAULT: `None`

Source code in src/openjarvis/connectors/chunker.py

def __init__(
    self,
    max_tokens: int = 512,
    *,
    max_chars: Optional[int] = None,
    overlap_tokens: Optional[int] = None,
) -> None:
    self.max_tokens = max_tokens
    self.max_chars = max_chars if max_chars is not None else max_tokens * 4
    if overlap_tokens is None:
        overlap_tokens = min(100, max(1, max_tokens // 5))
    self.overlap_tokens = max(0, min(overlap_tokens, max(0, max_tokens - 1)))

    # Content budget per chunk leaves room for the overlap prefix
    # that gets added in the final pass, so tokens stay within
    # max_tokens. Char budget intentionally does NOT subtract overlap
    # — we accept up to ~overlap_tokens*4 chars of headroom over
    # max_chars after overlap, which still sits under any reasonable
    # hard ceiling and avoids spurious force-splits of well-behaved
    # sentences in configurations where the char cap is tight.
    self._content_tokens = max(1, self.max_tokens - self.overlap_tokens)
    self._content_chars = self.max_chars
    # Soft target used to decide when a paragraph is "long enough to
    # sub-split", and to size sentence-accumulated sub-chunks. At
    # roughly half the hard cap, typical paragraphs ship as a single
    # chunk and only the unusually long ones get further split,
    # which keeps semantic units intact while still pulling the
    # chunk-length tail down.
    self._target_tokens = max(1, self._content_tokens // 2)
    self._target_chars = max(1, self._content_chars // 2)

Functions¶

chunk ¶

chunk(text: str, *, doc_type: str = 'document', metadata: Optional[Dict[str, Any]] = None) -> List[ChunkResult]

Split text into ChunkResult objects.

Returns an empty list if text is empty or whitespace-only. Events and contacts are always returned as a single chunk regardless of size; all other types respect the size caps and receive overlap between consecutive chunks.