Document intelligence evolved: Building and evaluating KIE solutions that scale

TL;DR

This article demonstrates an end-to-end approach to building and evaluating a KIE (key information extraction) solution using Amazon Nova models through Amazon Bedrock.
It covers three phases: data readiness, solution development, and performance measurement, with a practical case study based on the FATURA invoice dataset.
A model-agnostic prompting strategy is highlighted, including templating with Jinja2 and the use of the Amazon Bedrock Converse API for unified model interaction.
Evaluation combines accuracy and business value by using F1-score and considering latency and cost per document.

Context and background

Intelligent document processing (IDP) refers to the automated extraction, classification, and processing of data from various document formats, both structured and unstructured. Within IDP, key information extraction (KIE) enables systems to identify and extract critical data points with minimal human intervention. Organizations across sectors such as financial services, healthcare, legal, and supply chain management increasingly rely on IDP to reduce manual data entry and accelerate business processes. As document volumes grow, IDP solutions enable sophisticated agentic workflows where AI systems can analyze extracted data and take actions with little human input. Processing invoices, contracts, medical records, and regulatory documents accurately has become a business necessity, not just a competitive advantage. Developing effective IDP solutions requires robust extraction capabilities and tailored evaluation frameworks aligned with industry needs and specific use cases. The post demonstrates an end-to-end approach for building and evaluating a KIE solution using Amazon Nova models available through Amazon Bedrock, with three critical phases: data readiness (understanding and preparing documents), solution development (implementing extraction logic with appropriate models), and performance measurement (evaluating accuracy, efficiency, and cost-effectiveness). The FATURA dataset is used as a practical proxy for real-world enterprise data. It provides realistic document processing scenarios and reliable ground truth for evaluation. This article also discusses how to select, implement, and evaluate foundation models for document processing while weighing extraction accuracy, processing speed, and costs. For practitioners—from data scientists to developers and business analysts—the guide offers insights into using large language models for document extraction tasks and establishing meaningful metrics for decision-making. The FATURA dataset features 10,000 invoices with 50 distinct layouts; 40 samples were drawn from 49 layouts for a total of 1,960 samples, with ground-truth annotations for 24 fields per document. Ground-truth variations were standardized to ensure fair evaluation across layouts. This is also where model-agnostic prompting approaches and robust evaluation frameworks come into play, guiding practitioners to balance technical performance with business value. The blog emphasizes that this approach helps organizations move toward scalable, accurate, and efficient document processing solutions. AWS ML Blog

What’s new

The post showcases a practical, end-to-end KIE workflow that leverages Amazon Bedrock’s foundation models (Nova family) via the Converse API. Key innovations include:

A streamlined interface via the Converse API that abstracts model-specific formatting, enabling rapid experimentation and model comparisons for document extraction tasks.
Model-agnostic prompting strategies implemented with templating frameworks (e.g., Jinja2) to maintain a single prompt structure while incorporating rule-based logic across various extraction scenarios.
A robust treatment of real-world data challenges, including missing fields, multiple values for a single field (e.g., multiple phone numbers), and fields that can be structured or unstructured (addresses), as well as value hierarchies (tax amounts dependent on subtotals).
A practical emphasis on input modalities, including support for text, images, or multimodal inputs, with a unified composite input structure that simplifies handling multiple information sources in a single request.
Guidance on data readiness and ground-truth standardization. The FATURA dataset is used to illustrate how to align ground truth with an LLM-based extraction output, including normalizing prefixes and field representations to ensure fair evaluation.
The use of Jinja2 templates loaded via LangChain PromptTemplate to populate prompts with document-specific data, including OCR text and field descriptions, and to generate final prompts for the LLM.
A focus on robust evaluation beyond simple accuracy, using the F1-score to balance precision and recall, and incorporating business considerations such as processing latency and cost per document.

Why it matters (impact for developers/enterprises)

For developers and data scientists, the post clarifies how to experiment with foundation models for document processing in a model-agnostic way, reducing the need for bespoke, rule-based systems. The Converse API simplifies model interaction and accelerates iteration across different models, helping teams compare extraction quality, speed, and cost under realistic workloads. For enterprises, the approach provides a blueprint for evaluating KIE solutions against business objectives. By integrating a realistic dataset like FATURA and emphasizing metrics that reflect operational value (precision, recall, F1, latency, and per-document cost), organizations can choose models and configurations that balance accuracy with throughput and budget considerations. The emphasis on handling incomplete data, multi-valued fields, and mixed input modalities mirrors real-world scenarios where data quality varies across documents and layouts. The broader takeaway is a path to scalable, accurate, and cost-aware document processing that aligns with automated, agentic workflows. In practice, this enables organizations to move beyond manual handling toward more efficient and reliable document processing pipelines. The guidance is aligned with how Bedrock and LLM-based KIE solutions can be integrated into enterprise data pipelines, reducing manual intervention while maintaining robust evaluation practices. AWS ML Blog

Technical details or Implementation

The implementation revolves around a three-phase pipeline and a suite of practical techniques designed to produce scalable KIE solutions:

Data readiness and ground truth normalization: The FATURA dataset comprises 10,000 invoices across 50 layouts, with 24 fields per document. To enable fair evaluation, the team normalized ground-truth annotations to address structural (nested vs. flat representations) and value format inconsistencies (e.g., prefixed fields like INVOICE DATE: 01/15/2023). A sample of 40 documents from 49 layouts provided 1,960 evaluation samples, with an imbalanced distribution of fields (roughly 250 to 1,800 instances across 18 fields). The real-world challenge of missing fields is acknowledged and incorporated into evaluation considerations.
Model interaction via the Converse API: The Amazon Bedrock Converse API provides a unified interface for invoking foundation models, removing the complexity of model-specific formatting. Core parameters include model_id and messages containing prompts and context. This architecture facilitates experiments across different models and speeds iteration for document extraction workflows. AWS ML Blog
Prompts and templating: Effective information extraction relies on consistent, model-agnostic prompts. Templating frameworks like Jinja2 enable maintaining a single prompt structure while incorporating rule-based logic. A Jinja2-based KIE template is populated with document-specific data (e.g., OCR text, field descriptions) through a LangChain PromptTemplate, which loads the Jinja2 template and fills a dictionary of variables to generate the final prompt.
Multimodal input handling: To accommodate multiple input modalities in a single request, a content array is constructed with entries for each image (encoded appropriately) and a separate entry for the text prompt. This unified input structure supports text-only, image-only, or multimodal extraction tasks without bespoke per-modality handling logic.
Image processing: The image_to_bytes utility converts document images into a model-friendly format, with potential model-specific resizing optimizations to enhance performance.
Evaluation framework: Beyond accuracy, the evaluation includes precision and recall, with field-specific comparators designed to determine when an extraction is a true positive, false positive, or false negative. The designers emphasize the need to account for variations in dates, currencies, and formatted values, ensuring a robust measure of extraction quality. Additionally, practical business metrics such as latency and cost per document are integrated into the evaluation to reflect real-world constraints.
Key table of dataset characteristics (illustrative overview): | Dataset characteristic | Description |--- |--- |FATURA invoices | 10,000 invoices across 50 layouts |Layout variety | 50 distinct layouts; 24 fields per document |Ground-truth samples used | 1,960 (40 documents from 49 layouts) |Field distribution | 18 fields with imbalanced occurrences (approx. 250 to 1,800 per field) |Ground-truth normalization | Removed inconsistent prefixes and aligned with LLM output structure |
Evaluation metric and computation: The F1-score is used to balance precision and recall for field extractions. True positives, false positives, and false negatives are determined by whether extracted values match the standardized ground truth, taking into account textual and numeric formatting variations. The approach acknowledges that some fields may be more critical than others, and that overall performance must reflect both extraction accuracy and business impact. AWS ML Blog

Key takeaways

End-to-end KIE pipelines can be built using foundation models via Bedrock, with a unified API that accelerates experimentation and model comparison.
Ground-truth standardization is essential to fair evaluation in real-world document processing scenarios.
A multimodal input approach and robust prompting strategies enable flexible extraction across varied document types and layouts.
Evaluation must balance technical metrics (precision, recall, F1) with business considerations (latency, cost per document).
The FATURA dataset provides a realistic proxy for enterprise invoices, highlighting common data challenges such as missing fields, multi-value fields, and hierarchical data (e.g., tax amounts).

FAQ

What is the role of the FATURA dataset in this study?

FATURA provides a representative set of 10,000 invoices across 50 layouts with 24 fields per document, used to illustrate ground-truth standardization, sampling, and evaluation workflows for KIE.
Why use the Converse API in Bedrock for KIE tasks?

The Converse API offers a streamlined, model-agnostic interface for interacting with foundation models, enabling faster experimentation and easier model comparisons for document extraction workflows.
How is extraction quality measured for KIE in this context?

Extraction quality is evaluated using the F1-score, which balances precision and recall. Field-level comparators handle variations in formats and representations, ensuring a fair assessment of ground-truth alignment.
What practical factors beyond accuracy matter in evaluation?

Latency (processing speed) and cost per document are included in the evaluation to reflect enterprise constraints and deployment considerations.

References

https://aws.amazon.com/blogs/machine-learning/document-intelligence-evolved-building-and-evaluating-kie-solutions-that-scale/

Document intelligence evolved: Building and evaluating KIE solutions that scale

TL;DR

Context and background

What’s new

Why it matters (impact for developers/enterprises)

Technical details or Implementation

Key takeaways

FAQ

References

More news

First look at the Google Home app powered by Gemini

Shadow Leak shows how ChatGPT agents can exfiltrate Gmail data via prompt injection

Move AI agents from proof of concept to production with Amazon Bedrock AgentCore

Predict Extreme Weather in Minutes Without a Supercomputer: Huge Ensembles (HENS)

Scaleway Joins Hugging Face Inference Providers for Serverless, Low-Latency Inference

Google expands Gemini in Chrome with cross-platform rollout and no membership fee