Comprehensive Guide to Azure AI Search for Intelligent Document Processing with Azure Functions

Azure AI Search combined with Azure Functions offers a powerful, scalable solution for automating the processing of documents stored in the cloud. This in-depth guide will walk you through building an end-to-end intelligent document processing pipeline using Azure Resources, Python, and the Document Intelligence service. You’ll learn practical best practices and see examples that demonstrate how to extract structured data from uploaded PDFs and transform it into CSV format ready for analytics.

Introduction

In many enterprises, document processing remains a time-consuming and manual task. Forms, invoices, contracts, and reports often contain tables and unstructured data that need to be digitized and analyzed. Azure AI Search’s Document Intelligence capabilities, together with Azure Functions, enable developers to automate this workflow efficiently.

This article covers:

Setting up Azure Storage for input and output containers
Creating an Azure Function app triggered by blob uploads
Calling the Document Intelligence Layout model to extract tables
Parsing JSON responses into tabular CSV files
Uploading extracted data back to blob storage

By automating these steps, you gain a scalable cloud-based solution that integrates easily with analytics tools like Microsoft Power BI.

Prerequisites

Before starting, ensure you have the following:

An active Azure subscription (create one for free)
A Document Intelligence resource in Azure Cognitive Services for API key and endpoint
Python 3.6.x to 3.9.x installed locally
Visual Studio Code with Azure Functions, Python extensions, and Azure Functions Core Tools v3
Azure Storage Explorer installed
Sample PDF document for testing (e.g., Azure sample layout PDF)

Step 1: Create Azure Storage Account and Containers

Start by provisioning an Azure Storage account with blob containers to handle document input and processed output.

Instructions:

Navigate to the Azure portal and create a General-purpose v2 Storage Account.
Choose Standard performance tier.
In the storage account, create two blob containers: input and output.
Set the input container’s public access level to Container (anonymous read access).
Remove any existing CORS policies under the Resource sharing (CORS) tab to avoid cross-origin issues.

This setup enables your Azure Function to monitor input for new documents and deposit processed CSVs into output.

Step 2: Create an Azure Functions Project

Azure Functions allow serverless execution of your processing logic triggered by blob uploads.

Procedure:

Create a local folder named functions-app for your project.
Open VS Code and select a Python interpreter (3.6-3.9 recommended).
From the Azure extension panel, create a new Function:
- Select Create new project in the functions-app folder.
- Choose Python as the language.
- Select Azure Blob Storage trigger.
- Set the trigger to monitor the input/{name} blob path.
Open the generated __init__.py file which contains the function triggered on blob upload.

Default trigger example:

import logging
import azure.functions as func

def main(myblob: func.InputStream):
    logging.info(f"Python blob trigger function processed blob \n"
                 f"Name: {myblob.name}\n"
                 f"Blob Size: {myblob.length} bytes")

Step 3: Run and Test the Basic Blob Trigger

Press F5 in VS Code to start the function locally.
Upload a PDF document to the input container using Azure Storage Explorer.
Confirm the function logs indicate the blob trigger fired successfully.

This confirms your event-driven function is wired correctly.

Step 4: Add Document Intelligence Processing Logic

Enhance your function to call the Azure Document Intelligence Layout API, which analyzes and extracts tables and layout data from PDFs.

Update Dependencies

Add these packages to your requirements.txt:

cryptography
azure-functions
azure-storage-blob
azure-identity
requests
pandas
numpy

Import Necessary Libraries

Modify __init__.py imports:

import logging
from azure.storage.blob import BlobServiceClient
import azure.functions as func
import json
import time
import os
import requests
import pandas as pd
import numpy as np

Implement API Call

Inside the existing main function, insert the following snippet to call the Document Intelligence Layout model:

# Configure your endpoint and key
endpoint = r"Your Document Intelligence Endpoint"
apim_key = "Your Document Intelligence Key"
post_url = endpoint + "/formrecognizer/v2.1/layout/analyze"

# Read blob content
source = myblob.read()

headers = {
    'Content-Type': 'application/pdf',
    'Ocp-Apim-Subscription-Key': apim_key,
}

# Submit document for analysis
resp = requests.post(url=post_url, data=source, headers=headers)

if resp.status_code != 202:
    logging.error(f"POST analyze failed: {resp.text}")
    return

get_url = resp.headers["operation-location"]

# Wait for async analysis to complete
wait_sec = 25
logging.info(f"Waiting {wait_sec} seconds for analysis completion...")
time.sleep(wait_sec)

resp = requests.get(url=get_url, headers={"Ocp-Apim-Subscription-Key": apim_key})
resp_json = resp.json()

if resp_json.get("status") != "succeeded":
    logging.error(f"GET Layout results failed: {resp.text}")
    return

results = resp_json

Security tip: Never hardcode keys in production. Use Azure Key Vault or environment variables.

Step 5: Parse and Extract Table Data

The response JSON contains details about pages, tables, and cells. We’ll convert this into a pandas DataFrame and then export it as a CSV.

pages = results["analyzeResult"]["pageResults"]

# Function to process a single page

def make_page(p):
    res = []
    res_table = []
    page = pages[p]
    y = 0
    for tab in page["tables"]:
        for cell in tab["cells"]:
            res.append(cell)
            res_table.append(y)
        y += 1

    res_table_df = pd.DataFrame(res_table)
    res_df = pd.DataFrame(res)
    res_df["table_num"] = res_table_df[0]
    h = res_df.drop(columns=["boundingBox", "elements"])
    h["rownum"] = range(len(h))
    num_table = h["table_num"].max()
    return h, num_table, p

# Process first page
h, num_table, p = make_page(0)

# Loop through each table and reconstruct DataFrame
for k in range(num_table + 1):
    new_table = h[h.table_num == k].copy()
    new_table["rownum"] = range(len(new_table))
    row_table = pages[p]["tables"][k]["rows"]
    col_table = pages[p]["tables"][k]["columns"]
    b = pd.DataFrame(np.zeros((row_table, col_table)))
    s = 0
    for i, j in zip(new_table["rowIndex"], new_table["columnIndex"]):
        b.iat[i, j] = new_table.iloc[s]["text"]
        s += 1

    # b now holds the table data for table k

This generalized code extracts tables but may need tailoring based on your document’s structure.

Step 6: Upload CSV Results Back to Blob Storage

Finally, connect to your Azure Storage output container and upload the extracted CSV.

# Connect to blob storage
blob_service_client = BlobServiceClient.from_connection_string(
    "DefaultEndpointsProtocol=https;AccountName=YourStorageAccountName;"
    "AccountKey=YourStorageAccountKey;EndpointSuffix=core.windows.net"
)
container_client = blob_service_client.get_container_client("output")

# Convert DataFrame to CSV
csv_data = b.to_csv(header=False, index=False, mode='w')

# Define filename based on input PDF name
csv_name = os.path.splitext(os.path.basename(myblob.name))[0] + ".csv"

# Upload CSV blob
container_client.upload_blob(name=csv_name, data=csv_data, overwrite=True)

logging.info(f"Uploaded {csv_name} to output container.")

Note: Use overwrite=True to allow multiple uploads of the same file name during testing.

Step 7: Run the Complete Function and Validate

Restart your function with F5.
Upload a PDF to the input container.
Monitor logs for processing status.
Check the output container for the resulting CSV file.

You can now connect this output container to Power BI or other analytics tools to create rich visualizations from the extracted table data.

Best Practices and Tips

Secure Credentials: Avoid embedding keys in source code. Use Azure Key Vault or environment variables.
Handle Large Files: For bigger documents, increase the wait time or implement polling logic with retries.
Error Handling: Add comprehensive exception handling and logging to troubleshoot issues.
Customize Parsing: The sample parsing logic assumes a simple table structure; adapt it for complex layouts or multi-page documents.
Scaling: Azure Functions scale automatically; consider function timeout limits and optimize processing accordingly.

Conclusion

This comprehensive tutorial demonstrated how to leverage Azure AI Search’s Document Intelligence service in conjunction with Azure Functions to create an intelligent, automated document processing pipeline. By extracting tabular data from PDFs uploaded to Azure Blob Storage and converting it into CSV files, you enable seamless integration with analytics platforms like Power BI.

This architecture reduces manual data entry, improves accuracy, and accelerates business insights from unstructured documents.