Comprehensive Guide to Azure AI Search for Intelligent Document Processing with Azure Functions
Azure AI Search combined with Azure Functions offers a powerful, scalable solution for automating the processing of documents stored in the cloud. This in-depth guide will walk you through building an end-to-end intelligent document processing pipeline using Azure Resources, Python, and the Document Intelligence service. You’ll learn practical best practices and see examples that demonstrate how to extract structured data from uploaded PDFs and transform it into CSV format ready for analytics.
Introduction
In many enterprises, document processing remains a time-consuming and manual task. Forms, invoices, contracts, and reports often contain tables and unstructured data that need to be digitized and analyzed. Azure AI Search’s Document Intelligence capabilities, together with Azure Functions, enable developers to automate this workflow efficiently.
This article covers:
- Setting up Azure Storage for input and output containers
- Creating an Azure Function app triggered by blob uploads
- Calling the Document Intelligence Layout model to extract tables
- Parsing JSON responses into tabular CSV files
- Uploading extracted data back to blob storage
By automating these steps, you gain a scalable cloud-based solution that integrates easily with analytics tools like Microsoft Power BI.
Prerequisites
Before starting, ensure you have the following:
- An active Azure subscription (create one for free)
- A Document Intelligence resource in Azure Cognitive Services for API key and endpoint
- Python 3.6.x to 3.9.x installed locally
- Visual Studio Code with Azure Functions, Python extensions, and Azure Functions Core Tools v3
- Azure Storage Explorer installed
- Sample PDF document for testing (e.g., Azure sample layout PDF)
Step 1: Create Azure Storage Account and Containers
Start by provisioning an Azure Storage account with blob containers to handle document input and processed output.
Instructions:
- Navigate to the Azure portal and create a General-purpose v2 Storage Account.
- Choose Standard performance tier.
- In the storage account, create two blob containers:
inputandoutput. - Set the
inputcontainer’s public access level to Container (anonymous read access). - Remove any existing CORS policies under the Resource sharing (CORS) tab to avoid cross-origin issues.
This setup enables your Azure Function to monitor input for new documents and deposit processed CSVs into output.
Step 2: Create an Azure Functions Project
Azure Functions allow serverless execution of your processing logic triggered by blob uploads.
Procedure:
- Create a local folder named
functions-appfor your project. - Open VS Code and select a Python interpreter (3.6-3.9 recommended).
- From the Azure extension panel, create a new Function:
- Select Create new project in the
functions-appfolder. - Choose Python as the language.
- Select Azure Blob Storage trigger.
- Set the trigger to monitor the
input/{name}blob path.
- Select Create new project in the
- Open the generated
__init__.pyfile which contains the function triggered on blob upload.
Default trigger example:
import logging
import azure.functions as func
def main(myblob: func.InputStream):
logging.info(f"Python blob trigger function processed blob \n"
f"Name: {myblob.name}\n"
f"Blob Size: {myblob.length} bytes")
Step 3: Run and Test the Basic Blob Trigger
- Press F5 in VS Code to start the function locally.
- Upload a PDF document to the
inputcontainer using Azure Storage Explorer. - Confirm the function logs indicate the blob trigger fired successfully.
This confirms your event-driven function is wired correctly.
Step 4: Add Document Intelligence Processing Logic
Enhance your function to call the Azure Document Intelligence Layout API, which analyzes and extracts tables and layout data from PDFs.
Update Dependencies
Add these packages to your requirements.txt:
cryptography
azure-functions
azure-storage-blob
azure-identity
requests
pandas
numpy
Import Necessary Libraries
Modify __init__.py imports:
import logging
from azure.storage.blob import BlobServiceClient
import azure.functions as func
import json
import time
import os
import requests
import pandas as pd
import numpy as np
Implement API Call
Inside the existing main function, insert the following snippet to call the Document Intelligence Layout model:
# Configure your endpoint and key
endpoint = r"Your Document Intelligence Endpoint"
apim_key = "Your Document Intelligence Key"
post_url = endpoint + "/formrecognizer/v2.1/layout/analyze"
# Read blob content
source = myblob.read()
headers = {
'Content-Type': 'application/pdf',
'Ocp-Apim-Subscription-Key': apim_key,
}
# Submit document for analysis
resp = requests.post(url=post_url, data=source, headers=headers)
if resp.status_code != 202:
logging.error(f"POST analyze failed: {resp.text}")
return
get_url = resp.headers["operation-location"]
# Wait for async analysis to complete
wait_sec = 25
logging.info(f"Waiting {wait_sec} seconds for analysis completion...")
time.sleep(wait_sec)
resp = requests.get(url=get_url, headers={"Ocp-Apim-Subscription-Key": apim_key})
resp_json = resp.json()
if resp_json.get("status") != "succeeded":
logging.error(f"GET Layout results failed: {resp.text}")
return
results = resp_json
Security tip: Never hardcode keys in production. Use Azure Key Vault or environment variables.
Step 5: Parse and Extract Table Data
The response JSON contains details about pages, tables, and cells. We’ll convert this into a pandas DataFrame and then export it as a CSV.
pages = results["analyzeResult"]["pageResults"]
# Function to process a single page
def make_page(p):
res = []
res_table = []
page = pages[p]
y = 0
for tab in page["tables"]:
for cell in tab["cells"]:
res.append(cell)
res_table.append(y)
y += 1
res_table_df = pd.DataFrame(res_table)
res_df = pd.DataFrame(res)
res_df["table_num"] = res_table_df[0]
h = res_df.drop(columns=["boundingBox", "elements"])
h["rownum"] = range(len(h))
num_table = h["table_num"].max()
return h, num_table, p
# Process first page
h, num_table, p = make_page(0)
# Loop through each table and reconstruct DataFrame
for k in range(num_table + 1):
new_table = h[h.table_num == k].copy()
new_table["rownum"] = range(len(new_table))
row_table = pages[p]["tables"][k]["rows"]
col_table = pages[p]["tables"][k]["columns"]
b = pd.DataFrame(np.zeros((row_table, col_table)))
s = 0
for i, j in zip(new_table["rowIndex"], new_table["columnIndex"]):
b.iat[i, j] = new_table.iloc[s]["text"]
s += 1
# b now holds the table data for table k
This generalized code extracts tables but may need tailoring based on your document’s structure.
Step 6: Upload CSV Results Back to Blob Storage
Finally, connect to your Azure Storage output container and upload the extracted CSV.
# Connect to blob storage
blob_service_client = BlobServiceClient.from_connection_string(
"DefaultEndpointsProtocol=https;AccountName=YourStorageAccountName;"
"AccountKey=YourStorageAccountKey;EndpointSuffix=core.windows.net"
)
container_client = blob_service_client.get_container_client("output")
# Convert DataFrame to CSV
csv_data = b.to_csv(header=False, index=False, mode='w')
# Define filename based on input PDF name
csv_name = os.path.splitext(os.path.basename(myblob.name))[0] + ".csv"
# Upload CSV blob
container_client.upload_blob(name=csv_name, data=csv_data, overwrite=True)
logging.info(f"Uploaded {csv_name} to output container.")
Note: Use
overwrite=Trueto allow multiple uploads of the same file name during testing.
Step 7: Run the Complete Function and Validate
- Restart your function with F5.
- Upload a PDF to the
inputcontainer. - Monitor logs for processing status.
- Check the
outputcontainer for the resulting CSV file.
You can now connect this output container to Power BI or other analytics tools to create rich visualizations from the extracted table data.
Best Practices and Tips
- Secure Credentials: Avoid embedding keys in source code. Use Azure Key Vault or environment variables.
- Handle Large Files: For bigger documents, increase the wait time or implement polling logic with retries.
- Error Handling: Add comprehensive exception handling and logging to troubleshoot issues.
- Customize Parsing: The sample parsing logic assumes a simple table structure; adapt it for complex layouts or multi-page documents.
- Scaling: Azure Functions scale automatically; consider function timeout limits and optimize processing accordingly.
Conclusion
This comprehensive tutorial demonstrated how to leverage Azure AI Search’s Document Intelligence service in conjunction with Azure Functions to create an intelligent, automated document processing pipeline. By extracting tabular data from PDFs uploaded to Azure Blob Storage and converting it into CSV files, you enable seamless integration with analytics platforms like Power BI.
This architecture reduces manual data entry, improves accuracy, and accelerates business insights from unstructured documents.
Further Reading
- Azure Document Intelligence Overview
- Azure Functions Blob Trigger
- Azure Storage Blob SDK for Python
- Power BI Integration with Azure Blob Storage
Unlock the full potential of your documents today by building your own intelligent document processing pipeline with Azure AI Search and Azure Functions!