Skip to content

chunker

chunker

Type-aware semantic chunker for Deep Research ingestion.

Splits text by paragraph → sentence → token/character boundaries while enforcing a hard size cap and adding a fixed-size overlap between consecutive chunks. The cap is enforced on BOTH token count (whitespace-split) AND character count, since marketing-style emails frequently contain dense runs without whitespace (zero-width joiners, HTML residue) that defeat token-based limits alone.

Splitting strategy by doc_type
  • event, contact : Always a single chunk; never split, never capped.
  • email : Split on reply boundaries (On … wrote:), then paragraphs, then sentences, then force-split.
  • message : Split on double-newline boundaries, then sentences, then force-split.
  • document, note, anything else : Split on ## Heading section boundaries → paragraph boundaries → sentences → force-split.

Token counting uses whitespace splitting: len(text.split()).

Classes

ChunkResult dataclass

ChunkResult(content: str, index: int = 0, metadata: Dict[str, Any] = dict())

A single chunk produced by SemanticChunker.chunk().

SemanticChunker

SemanticChunker(max_tokens: int = 512, *, max_chars: Optional[int] = None, overlap_tokens: Optional[int] = None)

Split text by document type with a hard size cap and overlap.

PARAMETER DESCRIPTION
max_tokens

Hard upper bound on chunk size in whitespace-delimited tokens. No emitted chunk exceeds this.

TYPE: int DEFAULT: 512

max_chars

Hard upper bound on chunk size in characters. Defaults to max_tokens * 4 (a typical English chars-per-token estimate). No emitted chunk exceeds this — the chunker force-splits any run that does, even when token count alone would say it fits.

TYPE: Optional[int] DEFAULT: None

overlap_tokens

Token tail copied from each chunk into the head of the next so downstream retrieval doesn't miss context that straddles a chunk boundary. Defaults to min(100, max_tokens // 5) and is clamped to [0, max_tokens - 1]. Set to 0 to disable.

TYPE: Optional[int] DEFAULT: None

Source code in src/openjarvis/connectors/chunker.py
def __init__(
    self,
    max_tokens: int = 512,
    *,
    max_chars: Optional[int] = None,
    overlap_tokens: Optional[int] = None,
) -> None:
    self.max_tokens = max_tokens
    self.max_chars = max_chars if max_chars is not None else max_tokens * 4
    if overlap_tokens is None:
        overlap_tokens = min(100, max(1, max_tokens // 5))
    self.overlap_tokens = max(0, min(overlap_tokens, max(0, max_tokens - 1)))

    # Content budget per chunk leaves room for the overlap prefix
    # that gets added in the final pass, so tokens stay within
    # max_tokens. Char budget intentionally does NOT subtract overlap
    # — we accept up to ~overlap_tokens*4 chars of headroom over
    # max_chars after overlap, which still sits under any reasonable
    # hard ceiling and avoids spurious force-splits of well-behaved
    # sentences in configurations where the char cap is tight.
    self._content_tokens = max(1, self.max_tokens - self.overlap_tokens)
    self._content_chars = self.max_chars
    # Soft target used to decide when a paragraph is "long enough to
    # sub-split", and to size sentence-accumulated sub-chunks. At
    # roughly half the hard cap, typical paragraphs ship as a single
    # chunk and only the unusually long ones get further split,
    # which keeps semantic units intact while still pulling the
    # chunk-length tail down.
    self._target_tokens = max(1, self._content_tokens // 2)
    self._target_chars = max(1, self._content_chars // 2)
Functions
chunk
chunk(text: str, *, doc_type: str = 'document', metadata: Optional[Dict[str, Any]] = None) -> List[ChunkResult]

Split text into ChunkResult objects.

Returns an empty list if text is empty or whitespace-only. Events and contacts are always returned as a single chunk regardless of size; all other types respect the size caps and receive overlap between consecutive chunks.

Source code in src/openjarvis/connectors/chunker.py
def chunk(
    self,
    text: str,
    *,
    doc_type: str = "document",
    metadata: Optional[Dict[str, Any]] = None,
) -> List[ChunkResult]:
    """Split *text* into ``ChunkResult`` objects.

    Returns an empty list if *text* is empty or whitespace-only.
    Events and contacts are always returned as a single chunk
    regardless of size; all other types respect the size caps and
    receive overlap between consecutive chunks.
    """
    if not text or not text.strip():
        return []

    parent_meta: Dict[str, Any] = dict(metadata or {})

    if doc_type in ("event", "contact"):
        return [ChunkResult(content=text, index=0, metadata=parent_meta)]

    if doc_type == "email":
        raw_chunks = self._chunk_email(text)
    elif doc_type == "message":
        raw_chunks = self._chunk_message(text)
    else:
        raw_chunks = self._chunk_document(text)

    # Apply overlap globally between consecutive chunks.
    contents = [c for c, _ in raw_chunks]
    metas = [m for _, m in raw_chunks]
    overlapped = _apply_overlap(contents, overlap_tokens=self.overlap_tokens)

    results: List[ChunkResult] = []
    for idx, (content, extra_meta) in enumerate(zip(overlapped, metas)):
        merged: Dict[str, Any] = dict(parent_meta)
        merged.update(extra_meta)
        results.append(ChunkResult(content=content, index=idx, metadata=merged))
    return results