Skip to content

chunker

chunker

Type-aware semantic chunker for Deep Research ingestion.

Splits text based on document type, never splitting mid-sentence. Returns ChunkResult dataclass objects with section metadata and inherited parent metadata.

Splitting strategy by doc_type
  • event, contact : Always a single chunk; never split.
  • email : Split on reply boundaries (On … wrote:), then sentence-split within each part.
  • message : Split on double-newline boundaries, accumulate into chunks up to max_tokens.
  • document, note, anything else : Split on ## Heading section boundaries → paragraph boundaries (\n\n) within sections → sentence boundaries as a last resort.

Token counting uses whitespace splitting: len(text.split()).

Classes

ChunkResult dataclass

ChunkResult(content: str, index: int = 0, metadata: Dict[str, Any] = dict())

A single chunk produced by SemanticChunker.chunk().

SemanticChunker

SemanticChunker(max_tokens: int = 512)

Split text based on document type without breaking mid-sentence.

PARAMETER DESCRIPTION
max_tokens

Soft upper limit on chunk size measured in whitespace-delimited tokens (i.e. len(text.split())). Single unsplittable segments may exceed this limit.

TYPE: int DEFAULT: 512

Source code in src/openjarvis/connectors/chunker.py
def __init__(self, max_tokens: int = 512) -> None:
    self.max_tokens = max_tokens
Functions
chunk
chunk(text: str, *, doc_type: str = 'document', metadata: Optional[Dict[str, Any]] = None) -> List[ChunkResult]

Split text into ChunkResult objects.

PARAMETER DESCRIPTION
text

TYPE: str

doc_type

TYPE: str DEFAULT: 'document'

metadata

TYPE: Optional[Dict[str, Any]] DEFAULT: None

RETURNS DESCRIPTION
A list of ``ChunkResult`` objects with sequential 0-based ``index``
values. Returns an empty list if *text* is empty or whitespace-only.
Source code in src/openjarvis/connectors/chunker.py
def chunk(
    self,
    text: str,
    *,
    doc_type: str = "document",
    metadata: Optional[Dict[str, Any]] = None,
) -> List[ChunkResult]:
    """Split *text* into ``ChunkResult`` objects.

    Parameters
    ----------
    text:      The raw text to split.
    doc_type:  Controls the splitting strategy (see module docstring).
    metadata:  Parent metadata dict; copied into every chunk's ``metadata``.

    Returns
    -------
    A list of ``ChunkResult`` objects with sequential 0-based ``index``
    values.  Returns an empty list if *text* is empty or whitespace-only.
    """
    if not text or not text.strip():
        return []

    parent_meta: Dict[str, Any] = dict(metadata or {})

    if doc_type in ("event", "contact"):
        raw_chunks = self._chunk_atomic(text)
    elif doc_type == "email":
        raw_chunks = self._chunk_email(text)
    elif doc_type == "message":
        raw_chunks = self._chunk_message(text)
    else:
        # "document", "note", or any unknown type
        raw_chunks = self._chunk_document(text)

    results: List[ChunkResult] = []
    for idx, (content, extra_meta) in enumerate(raw_chunks):
        merged: Dict[str, Any] = dict(parent_meta)
        merged.update(extra_meta)
        results.append(ChunkResult(content=content, index=idx, metadata=merged))

    return results