Core

Library for breaking documents into chunks of sentences.

When a text-to-text model (e.g. a large language model with a fixed context size) can not accommodate a large document, this library can help us break the document into chunks of a required maximum length that we can perform inference on.

`CharInterval` `dataclass`

Class for representing a character interval.

Attributes:

Name	Type	Description
`start_pos`	`int \| None`	The starting position of the interval (inclusive).
`end_pos`	`int \| None`	The ending position of the interval (exclusive).

Source code in src/kibad_llm/extractors/chunking_utils/core.py

@dataclasses.dataclass
class CharInterval:
    """Class for representing a character interval.

    Attributes:
      start_pos: The starting position of the interval (inclusive).
      end_pos: The ending position of the interval (exclusive).
    """

    start_pos: int | None = None
    end_pos: int | None = None

`ChunkIterator`

Iterate through chunks of a tokenized text.

Chunks may consist of sentences or sentence fragments that can fit into the maximum character buffer that we can run inference on.

Chunk cases:

A) If a sentence length exceeds the max char buffer, then it needs to be broken into chunks that can fit within the max char buffer. We do this in a way that maximizes the chunk length while respecting newlines (if present) and token boundaries. Consider this sentence from a poem by John Donne:

No man is an island,
Entire of itself,
Every man is a piece of the continent,
A part of the main.

With max_char_buffer=40, the chunks are: * "No man is an island,\nEntire of itself," len=38 * "Every man is a piece of the continent," len=38 * "A part of the main." len=19

B) If a single token exceeds the max char buffer, it comprises the whole chunk. Consider the sentence: "This is antidisestablishmentarianism." With max_char_buffer=20, the chunks are: * "This is" len=7 * "antidisestablishmentarianism" len=28 * "." len(1)

C) If multiple whole sentences can fit within the max char buffer, then they are used to form the chunk. Consider the sentences: "Roses are red. Violets are blue. Flowers are nice. And so are you." With max_char_buffer=60, the chunks are: * "Roses are red. Violets are blue. Flowers are nice." len=50 * "And so are you." len=15

Source code in src/kibad_llm/extractors/chunking_utils/core.py

class ChunkIterator:
    r"""Iterate through chunks of a tokenized text.

    Chunks may consist of sentences or sentence fragments that can fit into the
    maximum character buffer that we can run inference on.

    Chunk cases:

    A)
    If a sentence length exceeds the max char buffer, then it needs to be broken
    into chunks that can fit within the max char buffer. We do this in a way that
    maximizes the chunk length while respecting newlines (if present) and token
    boundaries.
    Consider this sentence from a poem by John Donne:
    ```
    No man is an island,
    Entire of itself,
    Every man is a piece of the continent,
    A part of the main.
    ```
    With max_char_buffer=40, the chunks are:
    * "No man is an island,\nEntire of itself," len=38
    * "Every man is a piece of the continent," len=38
    * "A part of the main." len=19

    B)
    If a single token exceeds the max char buffer, it comprises the whole chunk.
    Consider the sentence:
    "This is antidisestablishmentarianism."
    With max_char_buffer=20, the chunks are:
    * "This is" len=7
    * "antidisestablishmentarianism" len=28
    * "." len(1)

    C)
    If multiple *whole* sentences can fit within the max char buffer, then they
    are used to form the chunk.
    Consider the sentences:
    "Roses are red. Violets are blue. Flowers are nice. And so are you."
    With max_char_buffer=60, the chunks are:
    * "Roses are red. Violets are blue. Flowers are nice." len=50
    * "And so are you." len=15
    """

    def __init__(
        self,
        document: str,
        max_char_buffer: int,
        tokenizer_impl: tokenizer_lib.Tokenizer,
    ):
        """Constructor.

        Args:
            document: Document to chunk. Can be either a string or a tokenized text.
            max_char_buffer: Size of buffer that we can run inference on.
            tokenizer_impl: Tokenizer instance to use.
        """

        if isinstance(document, str):
            tokenized_text = tokenizer_impl.tokenize(document)
        else:
            raise ValueError("document has the wrong format. str expected")
        self.tokenized_text = tokenized_text
        self.max_char_buffer = max_char_buffer
        self.sentence_iter = SentenceIterator(self.tokenized_text)
        self.broken_sentence = False
        self.document = document

    def __iter__(self) -> Iterator[TextChunk]:
        return self

    def _tokens_exceed_buffer(self, token_interval: tokenizer_lib.TokenInterval) -> bool:
        """Check if the token interval exceeds the maximum buffer size.

        Args:
          token_interval: Token interval to check.

        Returns:
          True if the token interval exceeds the maximum buffer size.
        """
        char_interval = get_char_interval(self.tokenized_text, token_interval)
        if char_interval.start_pos is None or char_interval.end_pos is None:
            return False
        return (char_interval.end_pos - char_interval.start_pos) > self.max_char_buffer

    def __next__(self) -> TextChunk:
        sentence = next(self.sentence_iter)
        # If the next token is greater than the max_char_buffer, let it be the
        # entire chunk.
        curr_chunk = create_token_interval(sentence.start_index, sentence.start_index + 1)
        if self._tokens_exceed_buffer(curr_chunk):
            self.sentence_iter = SentenceIterator(
                self.tokenized_text, curr_token_pos=sentence.start_index + 1
            )
            self.broken_sentence = curr_chunk.end_index < sentence.end_index
            return TextChunk(
                token_interval=curr_chunk,
                document_text=self.document,
                tokenized_text=self.tokenized_text,
            )

        # Append tokens to the chunk up to the max_char_buffer.
        start_of_new_line = -1
        for token_index in range(curr_chunk.start_index, sentence.end_index):
            if self.tokenized_text.tokens[token_index].first_token_after_newline:
                start_of_new_line = token_index
            test_chunk = create_token_interval(curr_chunk.start_index, token_index + 1)
            if self._tokens_exceed_buffer(test_chunk):
                # Only break at newline if: 1) newline exists (> 0) and
                # 2) it's after chunk start (prevents empty intervals)
                if start_of_new_line > 0 and start_of_new_line > curr_chunk.start_index:
                    # Terminate the curr_chunk at the start of the most recent newline.
                    curr_chunk = create_token_interval(curr_chunk.start_index, start_of_new_line)
                self.sentence_iter = SentenceIterator(
                    self.tokenized_text, curr_token_pos=curr_chunk.end_index
                )
                self.broken_sentence = True
                return TextChunk(
                    token_interval=curr_chunk,
                    document_text=self.document,
                    tokenized_text=self.tokenized_text,
                )
            else:
                curr_chunk = test_chunk

        if self.broken_sentence:
            self.broken_sentence = False
        else:
            for sentence in self.sentence_iter:
                test_chunk = create_token_interval(curr_chunk.start_index, sentence.end_index)
                if self._tokens_exceed_buffer(test_chunk):
                    self.sentence_iter = SentenceIterator(
                        self.tokenized_text,
                        curr_token_pos=curr_chunk.end_index,
                    )
                    return TextChunk(
                        token_interval=curr_chunk,
                        document_text=self.document,
                        tokenized_text=self.tokenized_text,
                    )
                else:
                    curr_chunk = test_chunk
        return TextChunk(
            token_interval=curr_chunk,
            document_text=self.document,
            tokenized_text=self.tokenized_text,
        )

`init(document, max_char_buffer, tokenizer_impl)`

Constructor.

Parameters:

Name	Type	Description	Default
`document`	`str`	Document to chunk. Can be either a string or a tokenized text.	required
`max_char_buffer`	`int`	Size of buffer that we can run inference on.	required
`tokenizer_impl`	`Tokenizer`	Tokenizer instance to use.	required

Source code in src/kibad_llm/extractors/chunking_utils/core.py

def __init__(
    self,
    document: str,
    max_char_buffer: int,
    tokenizer_impl: tokenizer_lib.Tokenizer,
):
    """Constructor.

    Args:
        document: Document to chunk. Can be either a string or a tokenized text.
        max_char_buffer: Size of buffer that we can run inference on.
        tokenizer_impl: Tokenizer instance to use.
    """

    if isinstance(document, str):
        tokenized_text = tokenizer_impl.tokenize(document)
    else:
        raise ValueError("document has the wrong format. str expected")
    self.tokenized_text = tokenized_text
    self.max_char_buffer = max_char_buffer
    self.sentence_iter = SentenceIterator(self.tokenized_text)
    self.broken_sentence = False
    self.document = document

`SentenceIterator`

Iterate through sentences of a tokenized text.

Source code in src/kibad_llm/extractors/chunking_utils/core.py

class SentenceIterator:
    """Iterate through sentences of a tokenized text."""

    def __init__(
        self,
        tokenized_text: tokenizer_lib.TokenizedText,
        curr_token_pos: int = 0,
    ):
        """Constructor.

        Args:
          tokenized_text: Document to iterate through.
          curr_token_pos: Iterate through sentences from this token position.

        Raises:
          IndexError: if curr_token_pos is not within the document.
        """
        self.tokenized_text = tokenized_text
        self.token_len = len(tokenized_text.tokens)
        if curr_token_pos < 0:
            raise IndexError(f"Current token position {curr_token_pos} can not be negative.")
        elif curr_token_pos > self.token_len:
            raise IndexError(
                f"Current token position {curr_token_pos} is past the length of the "
                f"document {self.token_len}."
            )
        self.curr_token_pos = curr_token_pos

    def __iter__(self) -> Iterator[tokenizer_lib.TokenInterval]:
        return self

    def __next__(self) -> tokenizer_lib.TokenInterval:
        """Returns next sentence's interval starting from current token position.

        Returns:
          Next sentence token interval starting from current token position.

        Raises:
          StopIteration: If end of text is reached.
        """
        assert self.curr_token_pos <= self.token_len
        if self.curr_token_pos == self.token_len:
            raise StopIteration
        # This locates the sentence which contains the current token position.
        sentence_range = tokenizer_lib.find_sentence_range(
            self.tokenized_text.text,
            self.tokenized_text.tokens,
            self.curr_token_pos,
        )
        assert sentence_range
        # Start the sentence from the current token position.
        # If we are in the middle of a sentence, we should start from there.
        sentence_range = create_token_interval(self.curr_token_pos, sentence_range.end_index)
        self.curr_token_pos = sentence_range.end_index
        return sentence_range

`init(tokenized_text, curr_token_pos=0)`

Constructor.

Parameters:

Name	Type	Description	Default
`tokenized_text`	`TokenizedText`	Document to iterate through.	required
`curr_token_pos`	`int`	Iterate through sentences from this token position.	`0`

Raises:

Type	Description
`IndexError`	if curr_token_pos is not within the document.

Source code in src/kibad_llm/extractors/chunking_utils/core.py

def __init__(
    self,
    tokenized_text: tokenizer_lib.TokenizedText,
    curr_token_pos: int = 0,
):
    """Constructor.

    Args:
      tokenized_text: Document to iterate through.
      curr_token_pos: Iterate through sentences from this token position.

    Raises:
      IndexError: if curr_token_pos is not within the document.
    """
    self.tokenized_text = tokenized_text
    self.token_len = len(tokenized_text.tokens)
    if curr_token_pos < 0:
        raise IndexError(f"Current token position {curr_token_pos} can not be negative.")
    elif curr_token_pos > self.token_len:
        raise IndexError(
            f"Current token position {curr_token_pos} is past the length of the "
            f"document {self.token_len}."
        )
    self.curr_token_pos = curr_token_pos

`next()`

Returns next sentence's interval starting from current token position.

Returns:

Type	Description
`TokenInterval`	Next sentence token interval starting from current token position.

Raises:

Type	Description
`StopIteration`	If end of text is reached.

Source code in src/kibad_llm/extractors/chunking_utils/core.py

def __next__(self) -> tokenizer_lib.TokenInterval:
    """Returns next sentence's interval starting from current token position.

    Returns:
      Next sentence token interval starting from current token position.

    Raises:
      StopIteration: If end of text is reached.
    """
    assert self.curr_token_pos <= self.token_len
    if self.curr_token_pos == self.token_len:
        raise StopIteration
    # This locates the sentence which contains the current token position.
    sentence_range = tokenizer_lib.find_sentence_range(
        self.tokenized_text.text,
        self.tokenized_text.tokens,
        self.curr_token_pos,
    )
    assert sentence_range
    # Start the sentence from the current token position.
    # If we are in the middle of a sentence, we should start from there.
    sentence_range = create_token_interval(self.curr_token_pos, sentence_range.end_index)
    self.curr_token_pos = sentence_range.end_index
    return sentence_range

`TextChunk` `dataclass`

Stores a text chunk with attributes to the source document.

Attributes:

Name	Type	Description
`token_interval`	`TokenInterval`	The token interval of the chunk in the source document.
`tokenized_text`	`TokenizedText`	The source document in its tokenized form as TokenizedText obj.
`document`	`TokenizedText`	The source document.

Properties: get_tokenized_text: TokenizedText of the current document. chunk_text: Text of the current chunk as string. sanitized_chunk_text Text of the current chunk as a sanitized string. -> _sanitize: Converts all whitespace characters in input text to a single space. char_interval: CharInterval of the chunk.

Source code in src/kibad_llm/extractors/chunking_utils/core.py

@dataclasses.dataclass
class TextChunk:
    """Stores a text chunk with attributes to the source document.

    Attributes:
      token_interval: The token interval of the chunk in the source document.
      tokenized_text: The source document in its tokenized form as TokenizedText obj.
      document: The source document.
    Properties:
      get_tokenized_text: TokenizedText of the current document.
      chunk_text: Text of the current chunk as string.
      sanitized_chunk_text Text of the current chunk as a sanitized string.
        -> _sanitize: Converts all whitespace characters in input text to a single space.
      char_interval: CharInterval of the chunk.

    """

    token_interval: tokenizer_lib.TokenInterval
    tokenized_text: tokenizer_lib.TokenizedText
    document_text: str | None = None
    _chunk_text: str | None = dataclasses.field(default=None, init=False, repr=False)
    _sanitized_chunk_text: str | None = dataclasses.field(default=None, init=False, repr=False)
    _char_interval: CharInterval | None = dataclasses.field(default=None, init=False, repr=False)

    def __str__(self):
        interval_repr = (
            f"start_index: {self.token_interval.start_index}, end_index:"
            f" {self.token_interval.end_index}"
        )

        try:
            chunk_text_repr = f"'{self.chunk_text}'"
        except ValueError:
            chunk_text_repr = "<unavailable: document_text not set>"

        return (
            "TextChunk(\n"
            f"  interval=[{interval_repr}],\n"
            f"  Chunk Text: {chunk_text_repr}\n"
            ")"
        )

    @property
    def get_tokenized_text(self) -> tokenizer_lib.TokenizedText | None:
        """Gets the tokenized text from the source document."""
        return self.tokenized_text

    @property
    def chunk_text(self) -> str:
        """Gets the chunk text. Raises an error if `document_text` is not set."""
        if self.document_text is None:
            raise ValueError("document_text must be set to access chunk_text.")
        if self._chunk_text is None:
            self._chunk_text = get_token_interval_text(self.tokenized_text, self.token_interval)
        return self._chunk_text

    @property
    def sanitized_chunk_text(self) -> str:
        """Gets the sanitized chunk text."""
        if self._sanitized_chunk_text is None:
            self._sanitized_chunk_text = _sanitize(self.chunk_text)
        return self._sanitized_chunk_text

    @property
    def char_interval(self) -> CharInterval:
        """Gets the character interval corresponding to the token interval.

        Returns:
          data.CharInterval: The character interval for this chunk.

        Raises:
          ValueError: If document_text is not set.
        """
        if self._char_interval is None:
            if self.document_text is None:
                raise ValueError("document_text must be set to compute char_interval.")
            self._char_interval = get_char_interval(self.tokenized_text, self.token_interval)
        return self._char_interval

`char_interval` `property`

Gets the character interval corresponding to the token interval.

Returns:

Type	Description
`CharInterval`	data.CharInterval: The character interval for this chunk.

Raises:

Type	Description
`ValueError`	If document_text is not set.

`chunk_text` `property`

Gets the chunk text. Raises an error if document_text is not set.

`get_tokenized_text` `property`

Gets the tokenized text from the source document.

`sanitized_chunk_text` `property`

Gets the sanitized chunk text.

`TokenUtilError`

Bases: BaseException

Error raised when token_util returns unexpected values.

Source code in src/kibad_llm/extractors/chunking_utils/core.py

class TokenUtilError(BaseException):
    """Error raised when token_util returns unexpected values."""

`create_token_interval(start_index, end_index)`

Creates a token interval.

Parameters:

Name	Type	Description	Default
`start_index`	`int`	first token's index (inclusive).	required
`end_index`	`int`	last token's index + 1 (exclusive).	required

Returns:

Type	Description
`TokenInterval`	Token interval.

Raises:

Type	Description
`ValueError`	If the token indices are invalid.

Source code in src/kibad_llm/extractors/chunking_utils/core.py

def create_token_interval(start_index: int, end_index: int) -> tokenizer_lib.TokenInterval:
    """Creates a token interval.

    Args:
      start_index: first token's index (inclusive).
      end_index: last token's index + 1 (exclusive).

    Returns:
      Token interval.

    Raises:
      ValueError: If the token indices are invalid.
    """
    if start_index < 0:
        raise ValueError(f"Start index {start_index} must be positive.")
    if start_index >= end_index:
        raise ValueError(f"Start index {start_index} must be < end index {end_index}.")
    return tokenizer_lib.TokenInterval(start_index=start_index, end_index=end_index)

`get_char_interval(tokenized_text, token_interval)`

Returns the char interval corresponding to the token interval.

Parameters:

Name	Type	Description	Default
`tokenized_text`	`TokenizedText`	Document.	required
`token_interval`	`TokenInterval`	Token interval.	required

Returns:

Type	Description
`CharInterval`	Char interval of the token interval of interest.

Raises:

Type	Description
`ValueError`	If the token_interval is invalid.

Source code in src/kibad_llm/extractors/chunking_utils/core.py

def get_char_interval(
    tokenized_text: tokenizer_lib.TokenizedText,
    token_interval: tokenizer_lib.TokenInterval,
) -> CharInterval:
    """Returns the char interval corresponding to the token interval.

    Args:
      tokenized_text: Document.
      token_interval: Token interval.

    Returns:
      Char interval of the token interval of interest.

    Raises:
      ValueError: If the token_interval is invalid.
    """
    if token_interval.start_index >= token_interval.end_index:
        raise ValueError(
            f"Start index {token_interval.start_index} must be < end index "
            f"{token_interval.end_index}."
        )
    start_token = tokenized_text.tokens[token_interval.start_index]
    # Penultimate token prior to interval.end_index
    final_token = tokenized_text.tokens[token_interval.end_index - 1]
    return CharInterval(
        start_pos=start_token.char_interval.start_pos,
        end_pos=final_token.char_interval.end_pos,
    )

`get_token_interval_text(tokenized_text, token_interval)`

Get the text within an interval of tokens.

Parameters:

Name	Type	Description	Default
`tokenized_text`	`TokenizedText`	Tokenized documents.	required
`token_interval`	`TokenInterval`	An interval specifying the start (inclusive) and end (exclusive) indices of the tokens to extract. These indices refer to the positions in the list of tokens within `tokenized_text.tokens`, not the value of the field `index` of `token_pb2.Token`. If the tokens are [(index:0, text:A), (index:5, text:B), (index:10, text:C)], we should use token_interval=[0, 2] to represent taking A and B, not [0, 6]. Please see details from the implementation of tokenizer_lib.tokens_text	required

Returns:

Type	Description
`str`	Text within the token interval.

Raises:

Type	Description
`ValueError`	If the token indices are invalid.
`TokenUtilError`	If tokenizer_lib.tokens_text returns an empty

Source code in src/kibad_llm/extractors/chunking_utils/core.py

def get_token_interval_text(
    tokenized_text: tokenizer_lib.TokenizedText,
    token_interval: tokenizer_lib.TokenInterval,
) -> str:
    """Get the text within an interval of tokens.

    Args:
      tokenized_text: Tokenized documents.
      token_interval: An interval specifying the start (inclusive) and end
        (exclusive) indices of the tokens to extract. These indices refer to the
        positions in the list of tokens within `tokenized_text.tokens`, not the
        value of the field `index` of `token_pb2.Token`. If the tokens are
        [(index:0, text:A), (index:5, text:B), (index:10, text:C)], we should use
        token_interval=[0, 2] to represent taking A and B, not [0, 6]. Please see
        details from the implementation of tokenizer_lib.tokens_text

    Returns:
      Text within the token interval.

    Raises:
      ValueError: If the token indices are invalid.
      TokenUtilError: If tokenizer_lib.tokens_text returns an empty
      string.
    """
    if token_interval.start_index >= token_interval.end_index:
        raise ValueError(
            f"Start index {token_interval.start_index} must be < end index "
            f"{token_interval.end_index}."
        )
    return_string = tokenizer_lib.tokens_text(tokenized_text, token_interval)
    logging.debug(
        "Token util returns string: %s for tokenized_text: %s, token_interval:" " %s",
        return_string,
        tokenized_text,
        token_interval,
    )
    if tokenized_text.text and not return_string:
        raise TokenUtilError(
            "Token util returns an empty string unexpectedly. Number of tokens is"
            f" tokenized_text: {len(tokenized_text.tokens)}, token_interval is"
            f" {token_interval.start_index} to {token_interval.end_index}, which"
            " should not lead to empty string."
        )
    return return_string

Core

CharInterval dataclass

ChunkIterator

__init__(document, max_char_buffer, tokenizer_impl)

SentenceIterator

__init__(tokenized_text, curr_token_pos=0)

__next__()

TextChunk dataclass

char_interval property

chunk_text property

get_tokenized_text property

sanitized_chunk_text property

TokenUtilError

create_token_interval(start_index, end_index)

get_char_interval(tokenized_text, token_interval)

get_token_interval_text(tokenized_text, token_interval)

`CharInterval` `dataclass`

`ChunkIterator`

`init(document, max_char_buffer, tokenizer_impl)`

`SentenceIterator`

`init(tokenized_text, curr_token_pos=0)`

`next()`

`TextChunk` `dataclass`

`char_interval` `property`

`chunk_text` `property`

`get_tokenized_text` `property`

`sanitized_chunk_text` `property`

`TokenUtilError`

`create_token_interval(start_index, end_index)`

`get_char_interval(tokenized_text, token_interval)`

`get_token_interval_text(tokenized_text, token_interval)`