Implementing a Snowflake Document AI Architecture for processing documents at scale

Businesses today process millions of documents annually, but managing unstructured data like invoices, forms, and logs can be a bottleneck. Snowflake Document AI offers a scalable solution to simplify and automate this process.

Streamline Workflows with Snowflake Document AI

Snowflake Document AI combines Snowflake's cloud data platform with machine learning to automate document processing. It transforms unstructured files (e.g., PDFs, forms, invoices) into structured data, enabling organizations to streamline workflows and derive meaningful insights.

The workflow includes:

Unstructured Data Conversion: Transform unstructured files, such as JPGs and PDFs, into structured tables for analysis.
Continuous Document Processing: Build pipelines to automatically process documents and extract insights in real time.
Domain-Specific Model Training: Train models to handle specialized document types, such as medical records or financial statements, tailored to business needs.

How Snowflake Document AI Works

Snowflake Document AI operates through three key components that simplify document processing and enable Snowflake AI workflow automation:

Snowsight UI: A user-friendly interface for managing document uploads, defining models, and validating extracted data. Snowsight simplifies the process for users by providing a visual approach to document processing.
SQL Integration: SQL-based AI functions, such as buildname!PREDICT, allow users to process documents programmatically. This enables seamless data extraction, transformation, and integration into the Snowflake data ecosystem.
Streams and Tasks: These Snowflake features enable real-time, automated pipelines. Streams detect changes in stages or tables (e.g., new document uploads), while Tasks trigger workflows to process the data and store results.

Together, these components form a robust system for automating workflows, enabling businesses to automate document processing with Snowflake efficiently. The following diagrams illustrate how these components work together.

Architecture

Pre-Requisites to implementing the Architecture

STOP!

Before proceeding with the Steps to implement Snowflake Document AI (next section explaining each of the above Architectural components), ensure that you have completed the preparatory steps outlined in the Snowflake Document AI Build: Step By Step Tutorial.

Steps 1–11 (Pre "Document AI Build" SQL) in this tutorial are absolute prerequisites.

They cover the setup of:

Roles and privileges.
Schemas and internal stages.
Other foundational configurations required for this implementation.

These steps ensure a successful Snowflake Document AI Architecture implementation by establishing the necessary environment for workflows, automation, and data processing.

Steps to Implement Snowflake Document AI

The following steps guide you through the Snowflake Document AI Architecture implementation, from setting up roles and privileges to training models and storing extracted data, outlining steps 1 to 6 as depicted in the Architectural diagram above.

Define a role for Document AI

Create a dedicated role, IC_DOCUMENT_AI_MANAGER, to manage document processing workflows.

Assign Privileges to IC_DOCUMENT_AI_MANAGER

Grant the SNOWFLAKE.DOCUMENT_INTELLIGENCE_CREATOR privilege to the IC_DOCUMENT_AI_MANAGER role.

This privilege enables the role to:

Create Document AI model builds.
Access and process unstructured documents for data extraction.
Extract structured data using SQLs.

Enable Model Building Privileges

Grant the SNOWFLAKE.ML.DOCUMENT_INTELLIGENCE privilege to the IC_DOCUMENT_AI_MANAGER role. This privilege allows the role to create model builds using the ARCTIC-TILT framework for Document AI.

Set Up Schema and Configure Internal Stage

To organize and manage documents for processing, use an existing database, such as RAW or STAGE, and create a schema named DOCUMENTS. This schema will include an internal stage where all documents requiring intelligence processing will be stored.

The IC_DOCUMENT_AI_MANAGER role must also have the necessary privileges to create streams and tasks to automate file processing as soon as new documents are loaded into the stage.

Build and Train Document AI Model

Use the Document AI Build feature to create a model based on the ARCTIC-TILT LLM framework. This model reads documents, extracts structured information, and improves accuracy through training. The training process involves making scoring adjustments to enhance the model's predictions and OCR results.

Note: For detailed instructions and SQL commands to create and configure the Document AI Build, refer to indigoChart's tutorial: How to Build and Train Snowflake Document AI Models.

Store Extracted Information

Store the extracted information using the following SQL command. This is particularly useful when processing multiple documents:

SQL:

SELECT DB_DOCUMENTS.DOCUMENTS.IC_DOCUMENT_AI_BUILD_1!PREDICT(

GET_PRESIGNED_URL(@<stage_name>, RELATIVE_PATH), 3)

FROM DIRECTORY(@<stage_name>);

Final Steps

By completing these 6 steps, you’ve laid the foundation for automating document processing with Snowflake Document Intelligence. To finalize your pipeline, refer to Steps 12–14 within Section 3 of the tutorial: How to Build and Train Snowflake Document AI Models.

These steps include:

● Processing extracted JSON data.

● Configuring dynamic tables to store structured outputs.

● Ensuring your pipeline is ready for production use.

Completing these steps unlocks the full potential of Snowflake Document AI, providing a scalable, efficient, and production-ready solution for document-heavy workflows.

Visit us at www.indigoChart.com or drop us a line at hello@indigochart.com

indigoChart Menu