chunker

chunker ¶

Type-aware semantic chunker for Deep Research ingestion.

Splits text based on document type, never splitting mid-sentence. Returns ChunkResult dataclass objects with section metadata and inherited parent metadata.

Splitting strategy by doc_type

event, contact : Always a single chunk; never split.
email : Split on reply boundaries (On … wrote:), then sentence-split within each part.
message : Split on double-newline boundaries, accumulate into chunks up to max_tokens.
document, note, anything else : Split on ## Heading section boundaries → paragraph boundaries (\n\n) within sections → sentence boundaries as a last resort.

Token counting uses whitespace splitting: len(text.split()).

Classes¶

ChunkResult `dataclass` ¶

ChunkResult(content: str, index: int = 0, metadata: Dict[str, Any] = dict())

A single chunk produced by SemanticChunker.chunk().

SemanticChunker ¶

SemanticChunker(max_tokens: int = 512)

Split text based on document type without breaking mid-sentence.

PARAMETER	DESCRIPTION
`max_tokens`	Soft upper limit on chunk size measured in whitespace-delimited tokens (i.e. `len(text.split())`). Single unsplittable segments may exceed this limit. TYPE: `int` DEFAULT: `512`

Source code in src/openjarvis/connectors/chunker.py

def __init__(self, max_tokens: int = 512) -> None:
    self.max_tokens = max_tokens

Functions¶

chunk ¶

chunk(text: str, *, doc_type: str = 'document', metadata: Optional[Dict[str, Any]] = None) -> List[ChunkResult]

Split text into ChunkResult objects.

PARAMETER	DESCRIPTION
`text`	TYPE: `str`
`doc_type`	TYPE: `str` DEFAULT: `'document'`
`metadata`	TYPE: `Optional[Dict[str, Any]]` DEFAULT: `None`

RETURNS	DESCRIPTION
A list of ``ChunkResult`` objects with sequential 0-based ``index``
`values. Returns an empty list if text is empty or whitespace-only.`