chunking
chunking
¶
Document chunking with configurable size and overlap.
Splits text into fixed-size chunks (measured in whitespace-split tokens) with a configurable overlap. Paragraph boundaries are respected when they fall within the chunk window.
Classes¶
ChunkConfig
dataclass
¶
Parameters controlling the chunking strategy.
Chunk
dataclass
¶
Chunk(content: str, source: str = '', offset: int = 0, index: int = 0, metadata: Dict[str, Any] = dict())
A single chunk produced by the chunking pipeline.
Functions¶
chunk_text
¶
chunk_text(text: str, *, source: str = '', config: Optional[ChunkConfig] = None) -> List[Chunk]
Split text into chunks respecting paragraph boundaries.
| PARAMETER | DESCRIPTION |
|---|---|
text
|
The full document text.
TYPE:
|
source
|
Originating filename or identifier.
TYPE:
|
config
|
Chunking parameters (uses defaults if
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
List of :class:`Chunk` objects, in order.
|
|
Source code in src/openjarvis/tools/storage/chunking.py
39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 | |