chunker
chunker
¶
Type-aware semantic chunker for Deep Research ingestion.
Splits text based on document type, never splitting mid-sentence.
Returns ChunkResult dataclass objects with section metadata and
inherited parent metadata.
Splitting strategy by doc_type
event,contact: Always a single chunk; never split.email: Split on reply boundaries (On … wrote:), then sentence-split within each part.message: Split on double-newline boundaries, accumulate into chunks up to max_tokens.document,note, anything else : Split on## Headingsection boundaries → paragraph boundaries (\n\n) within sections → sentence boundaries as a last resort.
Token counting uses whitespace splitting: len(text.split()).
Classes¶
ChunkResult
dataclass
¶
A single chunk produced by SemanticChunker.chunk().
SemanticChunker
¶
Split text based on document type without breaking mid-sentence.
| PARAMETER | DESCRIPTION |
|---|---|
max_tokens
|
Soft upper limit on chunk size measured in whitespace-delimited tokens
(i.e.
TYPE:
|
Source code in src/openjarvis/connectors/chunker.py
Functions¶
chunk
¶
chunk(text: str, *, doc_type: str = 'document', metadata: Optional[Dict[str, Any]] = None) -> List[ChunkResult]
Split text into ChunkResult objects.
| PARAMETER | DESCRIPTION |
|---|---|
text
|
TYPE:
|
doc_type
|
TYPE:
|
metadata
|
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
A list of ``ChunkResult`` objects with sequential 0-based ``index``
|
|
values. Returns an empty list if *text* is empty or whitespace-only.
|
|