Chunking
ChunkingExtractor
Extractor that chunks extraction and aggregates results per key. This extractor calls the base extraction function multiple times (for each chunk in the document) on the same input text, passing some previous context to each subsequent call.
Pass llm=None with verbose=True to get the number of chunks per document without inference.
WARNING: If a Token that is greater than max_char_buffer is encountered, it becomes its own chunk. This edge case can produce chunks that are larger than max_char_buffer would allow.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
aggregator
|
Aggregator
|
Method to aggregate the llm output for the individual chunks before returning |
required |
return_as_list
|
list[str] | None
|
List of field names to return as lists of all extracted values |
None
|
tokenizer
|
Tokenizer | None
|
tokenizer to use for chunking |
None
|
max_char_buffer
|
int
|
Max chunk size in characters |
20000
|
verbose
|
bool
|
Adds verbose logging |
False
|
**kwargs
|
Additional keyword arguments passed to the base extraction function. |
{}
|
Source code in src/kibad_llm/extractors/chunking.py
43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 | |