Skip to content

Core

Library for breaking documents into chunks of sentences.

When a text-to-text model (e.g. a large language model with a fixed context size) can not accommodate a large document, this library can help us break the document into chunks of a required maximum length that we can perform inference on.

CharInterval dataclass

Class for representing a character interval.

Attributes:

Name Type Description
start_pos int | None

The starting position of the interval (inclusive).

end_pos int | None

The ending position of the interval (exclusive).

Source code in src/kibad_llm/extractors/chunking_utils/core.py
33
34
35
36
37
38
39
40
41
42
43
@dataclasses.dataclass
class CharInterval:
    """Class for representing a character interval.

    Attributes:
      start_pos: The starting position of the interval (inclusive).
      end_pos: The ending position of the interval (exclusive).
    """

    start_pos: int | None = None
    end_pos: int | None = None

ChunkIterator

Iterate through chunks of a tokenized text.

Chunks may consist of sentences or sentence fragments that can fit into the maximum character buffer that we can run inference on.

Chunk cases:

A) If a sentence length exceeds the max char buffer, then it needs to be broken into chunks that can fit within the max char buffer. We do this in a way that maximizes the chunk length while respecting newlines (if present) and token boundaries. Consider this sentence from a poem by John Donne:

No man is an island,
Entire of itself,
Every man is a piece of the continent,
A part of the main.

With max_char_buffer=40, the chunks are: * "No man is an island,\nEntire of itself," len=38 * "Every man is a piece of the continent," len=38 * "A part of the main." len=19

B) If a single token exceeds the max char buffer, it comprises the whole chunk. Consider the sentence: "This is antidisestablishmentarianism." With max_char_buffer=20, the chunks are: * "This is" len=7 * "antidisestablishmentarianism" len=28 * "." len(1)

C) If multiple whole sentences can fit within the max char buffer, then they are used to form the chunk. Consider the sentences: "Roses are red. Violets are blue. Flowers are nice. And so are you." With max_char_buffer=60, the chunks are: * "Roses are red. Violets are blue. Flowers are nice." len=50 * "And so are you." len=15

Source code in src/kibad_llm/extractors/chunking_utils/core.py
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
class ChunkIterator:
    r"""Iterate through chunks of a tokenized text.

    Chunks may consist of sentences or sentence fragments that can fit into the
    maximum character buffer that we can run inference on.

    Chunk cases:

    A)
    If a sentence length exceeds the max char buffer, then it needs to be broken
    into chunks that can fit within the max char buffer. We do this in a way that
    maximizes the chunk length while respecting newlines (if present) and token
    boundaries.
    Consider this sentence from a poem by John Donne:
    ```
    No man is an island,
    Entire of itself,
    Every man is a piece of the continent,
    A part of the main.
    ```
    With max_char_buffer=40, the chunks are:
    * "No man is an island,\nEntire of itself," len=38
    * "Every man is a piece of the continent," len=38
    * "A part of the main." len=19

    B)
    If a single token exceeds the max char buffer, it comprises the whole chunk.
    Consider the sentence:
    "This is antidisestablishmentarianism."
    With max_char_buffer=20, the chunks are:
    * "This is" len=7
    * "antidisestablishmentarianism" len=28
    * "." len(1)

    C)
    If multiple *whole* sentences can fit within the max char buffer, then they
    are used to form the chunk.
    Consider the sentences:
    "Roses are red. Violets are blue. Flowers are nice. And so are you."
    With max_char_buffer=60, the chunks are:
    * "Roses are red. Violets are blue. Flowers are nice." len=50
    * "And so are you." len=15
    """

    def __init__(
        self,
        document: str,
        max_char_buffer: int,
        tokenizer_impl: tokenizer_lib.Tokenizer,
    ):
        """Constructor.

        Args:
            document: Document to chunk. Can be either a string or a tokenized text.
            max_char_buffer: Size of buffer that we can run inference on.
            tokenizer_impl: Tokenizer instance to use.
        """

        if isinstance(document, str):
            tokenized_text = tokenizer_impl.tokenize(document)
        else:
            raise ValueError("document has the wrong format. str expected")
        self.tokenized_text = tokenized_text
        self.max_char_buffer = max_char_buffer
        self.sentence_iter = SentenceIterator(self.tokenized_text)
        self.broken_sentence = False
        self.document = document

    def __iter__(self) -> Iterator[TextChunk]:
        return self

    def _tokens_exceed_buffer(self, token_interval: tokenizer_lib.TokenInterval) -> bool:
        """Check if the token interval exceeds the maximum buffer size.

        Args:
          token_interval: Token interval to check.

        Returns:
          True if the token interval exceeds the maximum buffer size.
        """
        char_interval = get_char_interval(self.tokenized_text, token_interval)
        if char_interval.start_pos is None or char_interval.end_pos is None:
            return False
        return (char_interval.end_pos - char_interval.start_pos) > self.max_char_buffer

    def __next__(self) -> TextChunk:
        sentence = next(self.sentence_iter)
        # If the next token is greater than the max_char_buffer, let it be the
        # entire chunk.
        curr_chunk = create_token_interval(sentence.start_index, sentence.start_index + 1)
        if self._tokens_exceed_buffer(curr_chunk):
            self.sentence_iter = SentenceIterator(
                self.tokenized_text, curr_token_pos=sentence.start_index + 1
            )
            self.broken_sentence = curr_chunk.end_index < sentence.end_index
            return TextChunk(
                token_interval=curr_chunk,
                document_text=self.document,
                tokenized_text=self.tokenized_text,
            )

        # Append tokens to the chunk up to the max_char_buffer.
        start_of_new_line = -1
        for token_index in range(curr_chunk.start_index, sentence.end_index):
            if self.tokenized_text.tokens[token_index].first_token_after_newline:
                start_of_new_line = token_index
            test_chunk = create_token_interval(curr_chunk.start_index, token_index + 1)
            if self._tokens_exceed_buffer(test_chunk):
                # Only break at newline if: 1) newline exists (> 0) and
                # 2) it's after chunk start (prevents empty intervals)
                if start_of_new_line > 0 and start_of_new_line > curr_chunk.start_index:
                    # Terminate the curr_chunk at the start of the most recent newline.
                    curr_chunk = create_token_interval(curr_chunk.start_index, start_of_new_line)
                self.sentence_iter = SentenceIterator(
                    self.tokenized_text, curr_token_pos=curr_chunk.end_index
                )
                self.broken_sentence = True
                return TextChunk(
                    token_interval=curr_chunk,
                    document_text=self.document,
                    tokenized_text=self.tokenized_text,
                )
            else:
                curr_chunk = test_chunk

        if self.broken_sentence:
            self.broken_sentence = False
        else:
            for sentence in self.sentence_iter:
                test_chunk = create_token_interval(curr_chunk.start_index, sentence.end_index)
                if self._tokens_exceed_buffer(test_chunk):
                    self.sentence_iter = SentenceIterator(
                        self.tokenized_text,
                        curr_token_pos=curr_chunk.end_index,
                    )
                    return TextChunk(
                        token_interval=curr_chunk,
                        document_text=self.document,
                        tokenized_text=self.tokenized_text,
                    )
                else:
                    curr_chunk = test_chunk
        return TextChunk(
            token_interval=curr_chunk,
            document_text=self.document,
            tokenized_text=self.tokenized_text,
        )

__init__(document, max_char_buffer, tokenizer_impl)

Constructor.

Parameters:

Name Type Description Default
document str

Document to chunk. Can be either a string or a tokenized text.

required
max_char_buffer int

Size of buffer that we can run inference on.

required
tokenizer_impl Tokenizer

Tokenizer instance to use.

required
Source code in src/kibad_llm/extractors/chunking_utils/core.py
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
def __init__(
    self,
    document: str,
    max_char_buffer: int,
    tokenizer_impl: tokenizer_lib.Tokenizer,
):
    """Constructor.

    Args:
        document: Document to chunk. Can be either a string or a tokenized text.
        max_char_buffer: Size of buffer that we can run inference on.
        tokenizer_impl: Tokenizer instance to use.
    """

    if isinstance(document, str):
        tokenized_text = tokenizer_impl.tokenize(document)
    else:
        raise ValueError("document has the wrong format. str expected")
    self.tokenized_text = tokenized_text
    self.max_char_buffer = max_char_buffer
    self.sentence_iter = SentenceIterator(self.tokenized_text)
    self.broken_sentence = False
    self.document = document

SentenceIterator

Iterate through sentences of a tokenized text.

Source code in src/kibad_llm/extractors/chunking_utils/core.py
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
class SentenceIterator:
    """Iterate through sentences of a tokenized text."""

    def __init__(
        self,
        tokenized_text: tokenizer_lib.TokenizedText,
        curr_token_pos: int = 0,
    ):
        """Constructor.

        Args:
          tokenized_text: Document to iterate through.
          curr_token_pos: Iterate through sentences from this token position.

        Raises:
          IndexError: if curr_token_pos is not within the document.
        """
        self.tokenized_text = tokenized_text
        self.token_len = len(tokenized_text.tokens)
        if curr_token_pos < 0:
            raise IndexError(f"Current token position {curr_token_pos} can not be negative.")
        elif curr_token_pos > self.token_len:
            raise IndexError(
                f"Current token position {curr_token_pos} is past the length of the "
                f"document {self.token_len}."
            )
        self.curr_token_pos = curr_token_pos

    def __iter__(self) -> Iterator[tokenizer_lib.TokenInterval]:
        return self

    def __next__(self) -> tokenizer_lib.TokenInterval:
        """Returns next sentence's interval starting from current token position.

        Returns:
          Next sentence token interval starting from current token position.

        Raises:
          StopIteration: If end of text is reached.
        """
        assert self.curr_token_pos <= self.token_len
        if self.curr_token_pos == self.token_len:
            raise StopIteration
        # This locates the sentence which contains the current token position.
        sentence_range = tokenizer_lib.find_sentence_range(
            self.tokenized_text.text,
            self.tokenized_text.tokens,
            self.curr_token_pos,
        )
        assert sentence_range
        # Start the sentence from the current token position.
        # If we are in the middle of a sentence, we should start from there.
        sentence_range = create_token_interval(self.curr_token_pos, sentence_range.end_index)
        self.curr_token_pos = sentence_range.end_index
        return sentence_range

__init__(tokenized_text, curr_token_pos=0)

Constructor.

Parameters:

Name Type Description Default
tokenized_text TokenizedText

Document to iterate through.

required
curr_token_pos int

Iterate through sentences from this token position.

0

Raises:

Type Description
IndexError

if curr_token_pos is not within the document.

Source code in src/kibad_llm/extractors/chunking_utils/core.py
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
def __init__(
    self,
    tokenized_text: tokenizer_lib.TokenizedText,
    curr_token_pos: int = 0,
):
    """Constructor.

    Args:
      tokenized_text: Document to iterate through.
      curr_token_pos: Iterate through sentences from this token position.

    Raises:
      IndexError: if curr_token_pos is not within the document.
    """
    self.tokenized_text = tokenized_text
    self.token_len = len(tokenized_text.tokens)
    if curr_token_pos < 0:
        raise IndexError(f"Current token position {curr_token_pos} can not be negative.")
    elif curr_token_pos > self.token_len:
        raise IndexError(
            f"Current token position {curr_token_pos} is past the length of the "
            f"document {self.token_len}."
        )
    self.curr_token_pos = curr_token_pos

__next__()

Returns next sentence's interval starting from current token position.

Returns:

Type Description
TokenInterval

Next sentence token interval starting from current token position.

Raises:

Type Description
StopIteration

If end of text is reached.

Source code in src/kibad_llm/extractors/chunking_utils/core.py
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
def __next__(self) -> tokenizer_lib.TokenInterval:
    """Returns next sentence's interval starting from current token position.

    Returns:
      Next sentence token interval starting from current token position.

    Raises:
      StopIteration: If end of text is reached.
    """
    assert self.curr_token_pos <= self.token_len
    if self.curr_token_pos == self.token_len:
        raise StopIteration
    # This locates the sentence which contains the current token position.
    sentence_range = tokenizer_lib.find_sentence_range(
        self.tokenized_text.text,
        self.tokenized_text.tokens,
        self.curr_token_pos,
    )
    assert sentence_range
    # Start the sentence from the current token position.
    # If we are in the middle of a sentence, we should start from there.
    sentence_range = create_token_interval(self.curr_token_pos, sentence_range.end_index)
    self.curr_token_pos = sentence_range.end_index
    return sentence_range

TextChunk dataclass

Stores a text chunk with attributes to the source document.

Attributes:

Name Type Description
token_interval TokenInterval

The token interval of the chunk in the source document.

tokenized_text TokenizedText

The source document in its tokenized form as TokenizedText obj.

document TokenizedText

The source document.

Properties: get_tokenized_text: TokenizedText of the current document. chunk_text: Text of the current chunk as string. sanitized_chunk_text Text of the current chunk as a sanitized string. -> _sanitize: Converts all whitespace characters in input text to a single space. char_interval: CharInterval of the chunk.

Source code in src/kibad_llm/extractors/chunking_utils/core.py
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
@dataclasses.dataclass
class TextChunk:
    """Stores a text chunk with attributes to the source document.

    Attributes:
      token_interval: The token interval of the chunk in the source document.
      tokenized_text: The source document in its tokenized form as TokenizedText obj.
      document: The source document.
    Properties:
      get_tokenized_text: TokenizedText of the current document.
      chunk_text: Text of the current chunk as string.
      sanitized_chunk_text Text of the current chunk as a sanitized string.
        -> _sanitize: Converts all whitespace characters in input text to a single space.
      char_interval: CharInterval of the chunk.

    """

    token_interval: tokenizer_lib.TokenInterval
    tokenized_text: tokenizer_lib.TokenizedText
    document_text: str | None = None
    _chunk_text: str | None = dataclasses.field(default=None, init=False, repr=False)
    _sanitized_chunk_text: str | None = dataclasses.field(default=None, init=False, repr=False)
    _char_interval: CharInterval | None = dataclasses.field(default=None, init=False, repr=False)

    def __str__(self):
        interval_repr = (
            f"start_index: {self.token_interval.start_index}, end_index:"
            f" {self.token_interval.end_index}"
        )

        try:
            chunk_text_repr = f"'{self.chunk_text}'"
        except ValueError:
            chunk_text_repr = "<unavailable: document_text not set>"

        return (
            "TextChunk(\n"
            f"  interval=[{interval_repr}],\n"
            f"  Chunk Text: {chunk_text_repr}\n"
            ")"
        )

    @property
    def get_tokenized_text(self) -> tokenizer_lib.TokenizedText | None:
        """Gets the tokenized text from the source document."""
        return self.tokenized_text

    @property
    def chunk_text(self) -> str:
        """Gets the chunk text. Raises an error if `document_text` is not set."""
        if self.document_text is None:
            raise ValueError("document_text must be set to access chunk_text.")
        if self._chunk_text is None:
            self._chunk_text = get_token_interval_text(self.tokenized_text, self.token_interval)
        return self._chunk_text

    @property
    def sanitized_chunk_text(self) -> str:
        """Gets the sanitized chunk text."""
        if self._sanitized_chunk_text is None:
            self._sanitized_chunk_text = _sanitize(self.chunk_text)
        return self._sanitized_chunk_text

    @property
    def char_interval(self) -> CharInterval:
        """Gets the character interval corresponding to the token interval.

        Returns:
          data.CharInterval: The character interval for this chunk.

        Raises:
          ValueError: If document_text is not set.
        """
        if self._char_interval is None:
            if self.document_text is None:
                raise ValueError("document_text must be set to compute char_interval.")
            self._char_interval = get_char_interval(self.tokenized_text, self.token_interval)
        return self._char_interval

char_interval property

Gets the character interval corresponding to the token interval.

Returns:

Type Description
CharInterval

data.CharInterval: The character interval for this chunk.

Raises:

Type Description
ValueError

If document_text is not set.

chunk_text property

Gets the chunk text. Raises an error if document_text is not set.

get_tokenized_text property

Gets the tokenized text from the source document.

sanitized_chunk_text property

Gets the sanitized chunk text.

TokenUtilError

Bases: BaseException

Error raised when token_util returns unexpected values.

Source code in src/kibad_llm/extractors/chunking_utils/core.py
46
47
class TokenUtilError(BaseException):
    """Error raised when token_util returns unexpected values."""

create_token_interval(start_index, end_index)

Creates a token interval.

Parameters:

Name Type Description Default
start_index int

first token's index (inclusive).

required
end_index int

last token's index + 1 (exclusive).

required

Returns:

Type Description
TokenInterval

Token interval.

Raises:

Type Description
ValueError

If the token indices are invalid.

Source code in src/kibad_llm/extractors/chunking_utils/core.py
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
def create_token_interval(start_index: int, end_index: int) -> tokenizer_lib.TokenInterval:
    """Creates a token interval.

    Args:
      start_index: first token's index (inclusive).
      end_index: last token's index + 1 (exclusive).

    Returns:
      Token interval.

    Raises:
      ValueError: If the token indices are invalid.
    """
    if start_index < 0:
        raise ValueError(f"Start index {start_index} must be positive.")
    if start_index >= end_index:
        raise ValueError(f"Start index {start_index} must be < end index {end_index}.")
    return tokenizer_lib.TokenInterval(start_index=start_index, end_index=end_index)

get_char_interval(tokenized_text, token_interval)

Returns the char interval corresponding to the token interval.

Parameters:

Name Type Description Default
tokenized_text TokenizedText

Document.

required
token_interval TokenInterval

Token interval.

required

Returns:

Type Description
CharInterval

Char interval of the token interval of interest.

Raises:

Type Description
ValueError

If the token_interval is invalid.

Source code in src/kibad_llm/extractors/chunking_utils/core.py
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
def get_char_interval(
    tokenized_text: tokenizer_lib.TokenizedText,
    token_interval: tokenizer_lib.TokenInterval,
) -> CharInterval:
    """Returns the char interval corresponding to the token interval.

    Args:
      tokenized_text: Document.
      token_interval: Token interval.

    Returns:
      Char interval of the token interval of interest.

    Raises:
      ValueError: If the token_interval is invalid.
    """
    if token_interval.start_index >= token_interval.end_index:
        raise ValueError(
            f"Start index {token_interval.start_index} must be < end index "
            f"{token_interval.end_index}."
        )
    start_token = tokenized_text.tokens[token_interval.start_index]
    # Penultimate token prior to interval.end_index
    final_token = tokenized_text.tokens[token_interval.end_index - 1]
    return CharInterval(
        start_pos=start_token.char_interval.start_pos,
        end_pos=final_token.char_interval.end_pos,
    )

get_token_interval_text(tokenized_text, token_interval)

Get the text within an interval of tokens.

Parameters:

Name Type Description Default
tokenized_text TokenizedText

Tokenized documents.

required
token_interval TokenInterval

An interval specifying the start (inclusive) and end (exclusive) indices of the tokens to extract. These indices refer to the positions in the list of tokens within tokenized_text.tokens, not the value of the field index of token_pb2.Token. If the tokens are [(index:0, text:A), (index:5, text:B), (index:10, text:C)], we should use token_interval=[0, 2] to represent taking A and B, not [0, 6]. Please see details from the implementation of tokenizer_lib.tokens_text

required

Returns:

Type Description
str

Text within the token interval.

Raises:

Type Description
ValueError

If the token indices are invalid.

TokenUtilError

If tokenizer_lib.tokens_text returns an empty

Source code in src/kibad_llm/extractors/chunking_utils/core.py
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
def get_token_interval_text(
    tokenized_text: tokenizer_lib.TokenizedText,
    token_interval: tokenizer_lib.TokenInterval,
) -> str:
    """Get the text within an interval of tokens.

    Args:
      tokenized_text: Tokenized documents.
      token_interval: An interval specifying the start (inclusive) and end
        (exclusive) indices of the tokens to extract. These indices refer to the
        positions in the list of tokens within `tokenized_text.tokens`, not the
        value of the field `index` of `token_pb2.Token`. If the tokens are
        [(index:0, text:A), (index:5, text:B), (index:10, text:C)], we should use
        token_interval=[0, 2] to represent taking A and B, not [0, 6]. Please see
        details from the implementation of tokenizer_lib.tokens_text

    Returns:
      Text within the token interval.

    Raises:
      ValueError: If the token indices are invalid.
      TokenUtilError: If tokenizer_lib.tokens_text returns an empty
      string.
    """
    if token_interval.start_index >= token_interval.end_index:
        raise ValueError(
            f"Start index {token_interval.start_index} must be < end index "
            f"{token_interval.end_index}."
        )
    return_string = tokenizer_lib.tokens_text(tokenized_text, token_interval)
    logging.debug(
        "Token util returns string: %s for tokenized_text: %s, token_interval:" " %s",
        return_string,
        tokenized_text,
        token_interval,
    )
    if tokenized_text.text and not return_string:
        raise TokenUtilError(
            "Token util returns an empty string unexpectedly. Number of tokens is"
            f" tokenized_text: {len(tokenized_text.tokens)}, token_interval is"
            f" {token_interval.start_index} to {token_interval.end_index}, which"
            " should not lead to empty string."
        )
    return return_string