Base
add_reasoning_content_callback(out, response, *, llm)
Add reasoning_content to output dictionary.
Source code in src/kibad_llm/extractors/base.py
494 495 496 497 498 499 500 501 | |
add_response_content_callback(out, response, *, llm)
Add response_content to output dictionary.
Source code in src/kibad_llm/extractors/base.py
484 485 486 487 488 489 490 491 | |
add_structured_callback(out, response, *, schema, validate_with_schema)
Add structured output to output dictionary based on response content.
Source code in src/kibad_llm/extractors/base.py
504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 | |
augment_and_strip_metadata_from_structured_callback(out, response, *, schema, original_schema, text, validate_with_schema, augment_metadata_kwargs=None)
Augment metadata in structured output and save it as structured_with_metadata.
Then, strip metadata and save the cleaned version back to structured.
Source code in src/kibad_llm/extractors/base.py
524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 | |
augment_metadata(data, *, text, content_key, **kwargs)
Recursively augment all metadata wrapper dicts in a JSON-parsed result with evidence info.
Traversal
- walks
datathrough nested dicts/lists - detects wrapper dicts via
_is_wrapper_dict(..., content_key=...) - for each wrapper dict, calls
augment_metadata_node_with_evidence(...)
Other Parameters:
| Name | Type | Description |
|---|---|---|
- kwargs are namespaced by prefix. Currently supported |
|
The returned structure mirrors the input but includes added evidence fields where applicable.
Source code in src/kibad_llm/extractors/base.py
423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 | |
augment_metadata_node_with_evidence(node, text, token_spans, *, anchor_key='evidence_anchor', num_matches_key='evidence_num_matches', start_key='first_evidence_start', end_key='first_evidence_end', snippet_key='first_evidence_snippet', snippet_margin=10)
Augment a single metadata wrapper dict with evidence location information.
Given a wrapper object like
{"content": ..., "evidence_anchor": "...", ...}
this function searches text for the anchor (via _find_anchor_match_spans). If at least
one match is found, it adds:
- num_matches_key: number of matches
- start_key / end_key: character offsets of the first match
- snippet_key: a substring of text spanning snippet_margin tokens around the match
(whitespace preserved)
If no anchor is present (or no matches exist), the wrapper is returned unchanged except for
num_matches_key (only added when an anchor is a non-empty string).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
node
|
Mapping[str, Any]
|
The metadata wrapper dict to augment. |
required |
text
|
str
|
The original text to search for evidence anchors. |
required |
token_spans
|
list[tuple[int, int]]
|
Precomputed list of (start_offset, end_offset) tuples for each token in |
required |
anchor_key
|
str
|
The key in wrapper dicts that holds the evidence anchor text. |
'evidence_anchor'
|
num_matches_key
|
str
|
The key to add for the number of matches of the anchor in the text. |
'evidence_num_matches'
|
start_key
|
str
|
The key to add for the start character offset of the anchor. |
'first_evidence_start'
|
end_key
|
str
|
The key to add for the end character offset of the anchor. |
'first_evidence_end'
|
snippet_key
|
str
|
The key to add for the evidence snippet text. |
'first_evidence_snippet'
|
snippet_margin
|
int
|
Number of tokens to include before and after the anchor span in the snippet. |
10
|
Returns: The augmented metadata wrapper dict with evidence metadata added where applicable.
Source code in src/kibad_llm/extractors/base.py
355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 | |
build_chat_message(message, role, document=None, document_placeholder='document', schema=None, schema_description_kwargs=None, schema_description_placeholder='schema_description')
Build a single chat message by inserting text and schema description if respective placeholders are present in the message template.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
message
|
str
|
The message template. |
required |
role
|
MessageRole
|
The role of the message (e.g., system, user). |
required |
document
|
str | None
|
The document text to process. |
None
|
document_placeholder
|
str
|
The placeholder in the message template for the input text. If the placeholder is present in the message template, it will be replaced with the input text. |
'document'
|
schema
|
dict[str, Any] | None
|
Optional JSON schema for structured output. |
None
|
schema_description_kwargs
|
dict[str, Any] | None
|
Optional kwargs for build_schema_description when generating the schema description. |
None
|
schema_description_placeholder
|
str
|
The placeholder in the message template for the schema description. If the placeholder is present in the message template, the schema must be provided and the description will be generated and inserted. |
'schema_description'
|
Returns:
| Type | Description |
|---|---|
SimpleChatMessage
|
A tuple of ChatMessage and a metadata dictionary indicating whether schema description |
dict[str, bool]
|
and text were inserted. |
Source code in src/kibad_llm/extractors/base.py
47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 | |
build_chat_messages(system_message=None, user_message=None, schema_description_placeholder='schema_description', document_placeholder='document', schema=None, history=None, return_messages=False, return_messages_formatted=False, truncate_user_message_formatted=300, _out=None, **build_messages_kwargs)
Build chat messages for extraction. The document text and schema description may be inserted into the message templates, depending on the presence of the respective placeholders.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
system_message
|
str | None
|
The system message template. |
None
|
user_message
|
str | None
|
The user message template. |
None
|
schema
|
dict[str, Any] | None
|
Optional JSON schema for structured output. |
None
|
schema_description_placeholder
|
str
|
The placeholder in the message templates for the schema description. If the placeholder is present in the message templates, the schema must be provided and the description will be generated and inserted. |
'schema_description'
|
document_placeholder
|
str
|
The placeholder in the message templates for the input text. If the placeholder is present in the message templates, it will be replaced with the input text. |
'document'
|
history
|
list[SimpleChatMessage] | None
|
Optional list of ChatMessage objects representing the conversation history. |
None
|
return_messages
|
bool
|
Whether to return the used prompt messages, but without input text and schema description. |
False
|
return_messages_formatted
|
bool
|
Whether to return the used prompt messages formatted with input text and schema description. |
False
|
truncate_user_message_formatted
|
int | None
|
If return_messages_formatted is True, truncate the user message content to this many characters (to avoid huge outputs). Set to None to disable truncation. |
300
|
_out
|
SingleExtractionResult | None
|
Optional output dictionary to store messages in (used internally). |
None
|
**build_messages_kwargs
|
Any
|
Additional keyword arguments for build_chat_message. |
{}
|
Returns:
| Type | Description |
|---|---|
list[SimpleChatMessage]
|
A list of ChatMessage objects. |
Source code in src/kibad_llm/extractors/base.py
107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 | |
exception2error_msg(e)
Return short and long (including traceback) error messages for an exception.
Source code in src/kibad_llm/extractors/base.py
41 42 43 44 | |
extract_from_text(text, text_id, prompt_template, schema=None, use_guided_decoding=True, validate_with_schema=True, llm=None, request_parameters=None, return_reasoning=False, adjust_schema_for_evidence_detection=False, adjust_schema_description_for_evidence_detection=False, evidence_anchor_description='Verbatim excerpt from the source text supporting the extracted content.', wrapped_content_description=None, response_has_metadata=False, augment_metadata_kwargs=None, user_message=None, system_message=None, schema_description_placeholder=None, document_placeholder=None, **build_messages_kwargs)
Extract structured information from text using an LLM.
Given a chat llm, composes system and user messages, and invokes the model. When a schema is provided, it is used to enforce guided decoding. The output is parsed as JSON and validated against the schema if provided.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
The document text to process. |
required |
text_id
|
str
|
Document text identifier for logging. |
required |
prompt_template
|
dict[str, str | None]
|
A dictionary with at least one of 'system_message' and 'user_message' templates (or both). |
required |
schema
|
dict[str, Any] | None
|
Optional JSON schema for structured output. |
None
|
use_guided_decoding
|
bool
|
Whether to use guided decoding. |
True
|
validate_with_schema
|
bool
|
Whether to validate the output against the provided schema. IMPORTANT: Disabling validation may lead to invalid structured outputs and, thus, may break result serialization (since we use .map() and .to_json() from datasets). |
True
|
llm
|
LLM | None
|
The LLM model to use. Must be a chat model (i.e. is_chat_model=True) and support extra_body parameters for guided decoding if schema is provided. If None, no LLM call is made. |
None
|
request_parameters
|
dict[str, Any] | None
|
Additional parameters to pass to the LLM chat call. |
None
|
return_reasoning
|
bool
|
Whether to return the reasoning done by the model. |
False
|
adjust_schema_for_evidence_detection
|
bool
|
Whether to adjust the schema to wrap terminal values
with metadata. If True, the schema is modified so that each terminal value is replaced
with an object containing the original value under the key |
False
|
adjust_schema_description_for_evidence_detection
|
bool
|
Whether to adjust the schema description when detect_evidence is True. If True, the schema description will mention that each value is accompanied by an evidence_anchor that is a "verbatim excerpt from the source text supporting the extracted content" (see METADATA_SCHEMA_WITH_EVIDENCE_SHORTHAND). Has only an effect if adjust_schema_for_detect_evidence is also True. |
False
|
evidence_anchor_description
|
str
|
Description for the evidence anchor field. |
'Verbatim excerpt from the source text supporting the extracted content.'
|
wrapped_content_description
|
str | None
|
Optional description for the content field in the metadata wrapper. |
None
|
response_has_metadata
|
bool
|
If True, the output is expected to have each leaf value wrapped in
an object with |
False
|
augment_metadata_kwargs
|
dict[str, Any] | None
|
Additional keyword arguments for augment_metadata. |
None
|
**build_messages_kwargs
|
Any
|
Additional keyword arguments for build_chat_messages. |
{}
|
Returns:
| Type | Description |
|---|---|
SingleExtractionResult
|
A SingleExtractionResult object with the extraction result. |
Source code in src/kibad_llm/extractors/base.py
568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 | |
extract_from_text_lenient(text, text_id, **kwargs)
Wrapper around extract_from_text that catches all exceptions.
This is useful when processing multiple documents and we want to continue processing even if one document fails.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
The text to process. |
required |
text_id
|
str
|
Text identifier for logging. |
required |
**kwargs
|
Keyword arguments for extract_from_text. |
{}
|
Returns: A SingleExtractionResult object with the extraction result or error message.
Source code in src/kibad_llm/extractors/base.py
756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 | |
strip_metadata(data, *, content_key)
Strip metadata wrappers from a JSON-parsed result produced by wrap_terminals_with_metadata.
The wrapped output encodes terminal values as objects like
{"
This function walks the parsed JSON (dict/list/scalars) and removes such wrappers by
replacing the wrapper dict with its <content_key> value.
Wrapper detection (heuristic):
- a dict is treated as a wrapper if it has content_key AND at least one additional key.
(We avoid unwrapping objects that only have {"<content_key>": ...}.)
Notes
- This function does not validate that "other keys" are truly metadata. If your original
extraction schema contains real objects that also have a
content_keyfield and other fields, they may be unwrapped unintentionally. If that’s a concern, use a more uniquecontent_key(e.g. "__content") in the schema wrapping step. - The input is not mutated; a transformed copy is returned.
Source code in src/kibad_llm/extractors/base.py
212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 | |