Back to Blog

Building Production-Ready Multimodal RAG Systems with Azure AI

14 min read
Azure AIMultimodal RAGAzure OpenAIDocument IntelligenceVector SearchProduction ML

Building Production-Ready Multimodal RAG Systems with Azure AI

Retrieval-Augmented Generation (RAG) has become the go-to architecture for grounding Large Language Models (LLMs) in domain-specific knowledge. However, most RAG implementations focus solely on text, leaving valuable information locked in images, tables, charts, and complex document layouts. In this post, I'll walk you through building a production-grade multimodal RAG system using Azure's AI services that can process and reason over text, images, tables, and structured data.

Why Multimodal RAG Matters

Traditional text-only RAG systems fail when:

  • Financial reports contain critical data in charts and tables
  • Medical records include diagnostic images and scanned handwriting
  • Legal documents have mixed layouts with signatures, stamps, and annotations
  • Technical manuals rely heavily on diagrams and schematics

Research shows that 30-60% of enterprise document content is non-textual. Ignoring this leaves significant gaps in your AI system's understanding.

Architecture Overview

Our multimodal RAG system leverages three core Azure AI services:

┌─────────────────────────────────────────────────────────┐
│                  Document Ingestion Layer                │
│  (PDFs, Images, DOCX, Spreadsheets, PowerPoint)         │
└─────────────────────────────────────────────────────────┘
                           ↓
┌─────────────────────────────────────────────────────────┐
│          Azure AI Document Intelligence (DI)            │
│  • Layout Analysis  • OCR  • Table Extraction           │
│  • Form Recognition • Custom Models                     │
└─────────────────────────────────────────────────────────┘
                           ↓
┌─────────────────────────────────────────────────────────┐
│              Multimodal Processing Layer                │
│  ┌──────────────┐  ┌─────────────┐  ┌───────────────┐  │
│  │ Text Chunks  │  │   Tables    │  │    Images     │  │
│  │ (Markdown)   │  │  (HTML/CSV) │  │  (Base64/URL) │  │
│  └──────────────┘  └─────────────┘  └───────────────┘  │
└─────────────────────────────────────────────────────────┘
                           ↓
┌─────────────────────────────────────────────────────────┐
│          Azure OpenAI - Embeddings Generation           │
│  • text-embedding-ada-002 (text & table summaries)      │
│  • GPT-4 Vision (image descriptions)                    │
└─────────────────────────────────────────────────────────┘
                           ↓
┌─────────────────────────────────────────────────────────┐
│           Azure Cognitive Search (Vector Store)         │
│  • Hybrid Search (Vector + Keyword + Semantic)          │
│  • Index separate fields for text/tables/images         │
└─────────────────────────────────────────────────────────┘
                           ↓
┌─────────────────────────────────────────────────────────┐
│              Query & Generation Layer                   │
│  User Query → Embedding → Retrieval → GPT-4o →  Answer │
└─────────────────────────────────────────────────────────┘

Implementation Deep Dive

1. Document Processing with Azure AI Document Intelligence

Azure AI Document Intelligence (formerly Form Recognizer) is our foundation. It provides sophisticated layout understanding that goes far beyond basic OCR.

from azure.ai.formrecognizer import DocumentAnalysisClient from azure.core.credentials import AzureKeyCredential import os # Initialize Document Intelligence client endpoint = os.getenv("AZURE_DI_ENDPOINT") key = os.getenv("AZURE_DI_KEY") document_analysis_client = DocumentAnalysisClient( endpoint=endpoint, credential=AzureKeyCredential(key) ) def analyze_document(document_path: str): """ Extract text, tables, and layout from documents using Azure Document Intelligence's prebuilt-layout model """ with open(document_path, "rb") as f: poller = document_analysis_client.begin_analyze_document( model_id="prebuilt-layout", document=f ) result = poller.result() # Extract structured content extracted_data = { "pages": [], "tables": [], "paragraphs": [] } # Process pages with bounding box coordinates for page in result.pages: extracted_data["pages"].append({ "page_number": page.page_number, "width": page.width, "height": page.height, "lines": [ { "content": line.content, "bbox": line.polygon } for line in page.lines ] }) # Extract tables with cell-level structure for table in result.tables: table_data = { "row_count": table.row_count, "column_count": table.column_count, "cells": [ { "content": cell.content, "row_index": cell.row_index, "column_index": cell.column_index, "row_span": cell.row_span, "column_span": cell.column_span } for cell in table.cells ] } # Convert to HTML for better LLM understanding table_html = convert_table_to_html(table_data) extracted_data["tables"].append({ "data": table_data, "html": table_html, "page_number": table.bounding_regions[0].page_number }) # Extract semantic paragraphs for para in result.paragraphs: extracted_data["paragraphs"].append({ "content": para.content, "role": para.role, # title, sectionHeading, pageHeader, etc. "page_number": para.bounding_regions[0].page_number }) return extracted_data def convert_table_to_html(table_data: dict) -> str: """Convert table data to HTML for better LLM comprehension""" html = "<table border='1'>\n" current_row = -1 sorted_cells = sorted( table_data["cells"], key=lambda x: (x["row_index"], x["column_index"]) ) for cell in sorted_cells: if cell["row_index"] != current_row: if current_row != -1: html += "</tr>\n" html += "<tr>\n" current_row = cell["row_index"] rowspan = f" rowspan='{cell['row_span']}'" if cell['row_span'] > 1 else "" colspan = f" colspan='{cell['column_span']}'" if cell['column_span'] > 1 else "" html += f"<td{rowspan}{colspan}>{cell['content']}</td>\n" html += "</tr>\n</table>" return html

Key Benefits:

  • Preserves table structure that pure text extraction loses
  • Identifies reading order across complex layouts
  • Handles multi-column documents correctly
  • Extracts form fields and key-value pairs

2. Image Processing with GPT-4 Vision

For images, charts, and diagrams, we use GPT-4 with Vision to generate descriptive captions that can be embedded and retrieved.

from openai import AzureOpenAI import base64 client = AzureOpenAI( api_key=os.getenv("AZURE_OPENAI_KEY"), api_version="2024-02-15-preview", azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT") ) def extract_images_from_pdf(document_path: str): """ Extract images from PDF pages using Azure Document Intelligence """ with open(document_path, "rb") as f: poller = document_analysis_client.begin_analyze_document( model_id="prebuilt-layout", document=f ) result = poller.result() images = [] for page in result.pages: # Extract figures with bounding boxes if hasattr(page, 'images'): for idx, image in enumerate(page.images): images.append({ "page_number": page.page_number, "image_index": idx, "bbox": image.polygon, "content": image.content # Base64 encoded image }) return images def generate_image_description(image_base64: str, context: str = "") -> str: """ Use GPT-4 Vision to generate detailed descriptions of images """ prompt = f"""Analyze this image from a document and provide a detailed description. Context: {context} Include: 1. What type of visualization is this (chart, diagram, photo, etc.) 2. Key data points or trends visible 3. Any text or labels in the image 4. Relevant insights for question-answering Be specific and factual.""" response = client.chat.completions.create( model="gpt-4o", # Vision-enabled model messages=[ { "role": "user", "content": [ {"type": "text", "text": prompt}, { "type": "image_url", "image_url": { "url": f"data:image/jpeg;base64,{image_base64}", "detail": "high" # Use high detail for better accuracy } } ] } ], max_tokens=500 ) return response.choices[0].message.content

Pro Tip: For charts and graphs, include surrounding text as context to help GPT-4V understand what the visualization represents.

3. Chunking Strategy for Multimodal Content

Effective chunking is crucial for RAG performance. We use a semantic chunking approach that respects document structure:

from typing import List, Dict import hashlib def create_multimodal_chunks( extracted_data: dict, images_with_descriptions: List[dict], chunk_size: int = 1000, chunk_overlap: int = 200 ) -> List[Dict]: """ Create chunks that preserve semantic boundaries and associate related multimodal content (text + tables + images per page/section) """ chunks = [] # Group content by page for page_num in range(1, len(extracted_data["pages"]) + 1): page_content = { "page_number": page_num, "text": "", "tables": [], "images": [] } # Collect paragraphs for this page page_paragraphs = [ p for p in extracted_data["paragraphs"] if p["page_number"] == page_num ] page_content["text"] = "\n\n".join([p["content"] for p in page_paragraphs]) # Collect tables for this page page_tables = [ t for t in extracted_data["tables"] if t["page_number"] == page_num ] page_content["tables"] = page_tables # Collect images for this page page_images = [ img for img in images_with_descriptions if img["page_number"] == page_num ] page_content["images"] = page_images # Create chunk(s) for this page # If page content is small, keep as single chunk if len(page_content["text"]) <= chunk_size: chunk = create_chunk(page_content, page_num) chunks.append(chunk) else: # Split large pages while preserving paragraph boundaries text_chunks = split_text_semantic( page_content["text"], chunk_size, chunk_overlap ) for idx, text_chunk in enumerate(text_chunks): chunk = create_chunk( { "page_number": page_num, "text": text_chunk, "tables": page_content["tables"] if idx == 0 else [], "images": page_content["images"] if idx == 0 else [] }, page_num, chunk_index=idx ) chunks.append(chunk) return chunks def create_chunk(content: dict, page_num: int, chunk_index: int = 0) -> Dict: """ Create a structured chunk with metadata """ # Build rich text representation chunk_text = content["text"] # Append table HTML if content["tables"]: chunk_text += "\n\n### Tables:\n" for table in content["tables"]: chunk_text += f"\n{table['html']}\n" # Append image descriptions if content["images"]: chunk_text += "\n\n### Images:\n" for img in content["images"]: chunk_text += f"\n- {img['description']}\n" # Generate unique chunk ID chunk_id = hashlib.md5( f"{page_num}_{chunk_index}_{chunk_text[:100]}".encode() ).hexdigest() return { "chunk_id": chunk_id, "page_number": page_num, "chunk_index": chunk_index, "content": chunk_text, "text_only": content["text"], "tables": content["tables"], "images": content["images"], "metadata": { "has_tables": len(content["tables"]) > 0, "has_images": len(content["images"]) > 0, "content_types": get_content_types(content) } } def get_content_types(content: dict) -> List[str]: """Identify what types of content are in this chunk""" types = [] if content["text"]: types.append("text") if content["tables"]: types.append("table") if content["images"]: types.append("image") return types def split_text_semantic(text: str, chunk_size: int, overlap: int) -> List[str]: """ Split text at paragraph boundaries rather than character count """ paragraphs = text.split("\n\n") chunks = [] current_chunk = [] current_length = 0 for para in paragraphs: para_length = len(para) if current_length + para_length > chunk_size and current_chunk: # Save current chunk chunks.append("\n\n".join(current_chunk)) # Keep last paragraph for overlap current_chunk = [current_chunk[-1], para] if overlap > 0 else [para] current_length = len("\n\n".join(current_chunk)) else: current_chunk.append(para) current_length += para_length if current_chunk: chunks.append("\n\n".join(current_chunk)) return chunks

4. Embedding and Indexing in Azure AI Search

Now we create embeddings and index everything in Azure AI Search (formerly Cognitive Search) with hybrid search capabilities:

from azure.search.documents import SearchClient from azure.search.documents.indexes import SearchIndexClient from azure.search.documents.indexes.models import ( SearchIndex, SimpleField, SearchableField, SearchField, VectorSearch, VectorSearchProfile, HnswAlgorithmConfiguration, SemanticConfiguration, SemanticField, SemanticPrioritizedFields, SemanticSearch ) from azure.core.credentials import AzureKeyCredential def create_search_index(index_name: str): """ Create Azure Cognitive Search index with vector and semantic search """ search_index_client = SearchIndexClient( endpoint=os.getenv("AZURE_SEARCH_ENDPOINT"), credential=AzureKeyCredential(os.getenv("AZURE_SEARCH_KEY")) ) # Define index schema fields = [ SimpleField(name="chunk_id", type="Edm.String", key=True), SearchableField(name="content", type="Edm.String", analyzer_name="en.microsoft"), SearchableField(name="text_only", type="Edm.String"), SearchField( name="content_vector", type="Collection(Edm.Single)", searchable=True, vector_search_dimensions=1536, # ada-002 dimension vector_search_profile_name="myHnswProfile" ), SimpleField(name="page_number", type="Edm.Int32", filterable=True), SimpleField(name="has_tables", type="Edm.Boolean", filterable=True), SimpleField(name="has_images", type="Edm.Boolean", filterable=True), SimpleField(name="content_types", type="Collection(Edm.String)", filterable=True), SimpleField(name="document_name", type="Edm.String", filterable=True), ] # Configure vector search vector_search = VectorSearch( algorithms=[ HnswAlgorithmConfiguration(name="myHnsw") ], profiles=[ VectorSearchProfile( name="myHnswProfile", algorithm_configuration_name="myHnsw", ) ] ) # Configure semantic search (L2 reranking) semantic_config = SemanticConfiguration( name="my-semantic-config", prioritized_fields=SemanticPrioritizedFields( title_field=None, content_fields=[SemanticField(field_name="content")], keywords_fields=[SemanticField(field_name="content_types")] ) ) semantic_search = SemanticSearch(configurations=[semantic_config]) # Create index index = SearchIndex( name=index_name, fields=fields, vector_search=vector_search, semantic_search=semantic_search ) result = search_index_client.create_or_update_index(index) print(f"Created index: {result.name}") return result def generate_embeddings(texts: List[str]) -> List[List[float]]: """ Generate embeddings using Azure OpenAI text-embedding-ada-002 https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/models#embeddings """ response = client.embeddings.create( input=texts, model="text-embedding-ada-002" ) return [data.embedding for data in response.data] def index_chunks(chunks: List[Dict], index_name: str, document_name: str): """ Generate embeddings and upload chunks to Azure Cognitive Search """ search_client = SearchClient( endpoint=os.getenv("AZURE_SEARCH_ENDPOINT"), index_name=index_name, credential=AzureKeyCredential(os.getenv("AZURE_SEARCH_KEY")) ) # Generate embeddings in batches batch_size = 16 for i in range(0, len(chunks), batch_size): batch = chunks[i:i + batch_size] texts = [chunk["content"] for chunk in batch] embeddings = generate_embeddings(texts) # Prepare documents for upload documents = [] for chunk, embedding in zip(batch, embeddings): doc = { "chunk_id": chunk["chunk_id"], "content": chunk["content"], "text_only": chunk["text_only"], "content_vector": embedding, "page_number": chunk["page_number"], "has_tables": chunk["metadata"]["has_tables"], "has_images": chunk["metadata"]["has_images"], "content_types": chunk["metadata"]["content_types"], "document_name": document_name } documents.append(doc) # Upload batch result = search_client.upload_documents(documents=documents) print(f"Indexed {len(result)} chunks")

5. Hybrid Search with Semantic Reranking

The magic happens at query time with Azure's hybrid search combining:

  • Vector search (semantic similarity)
  • Keyword search (BM25)
  • Semantic reranking (Microsoft's L2 reranker)
from azure.search.documents.models import VectorizedQuery def search_multimodal_rag( query: str, index_name: str, top_k: int = 5, filter_content_types: List[str] = None ) -> List[Dict]: """ Perform hybrid search with semantic reranking """ search_client = SearchClient( endpoint=os.getenv("AZURE_SEARCH_ENDPOINT"), index_name=index_name, credential=AzureKeyCredential(os.getenv("AZURE_SEARCH_KEY")) ) # Generate query embedding query_embedding = generate_embeddings([query])[0] # Create vector query vector_query = VectorizedQuery( vector=query_embedding, k_nearest_neighbors=top_k * 2, # Retrieve more for reranking fields="content_vector" ) # Build filter if specified filter_expression = None if filter_content_types: filter_parts = [ f"content_types/any(t: t eq '{ct}')" for ct in filter_content_types ] filter_expression = " or ".join(filter_parts) # Execute hybrid search with semantic reranking results = search_client.search( search_text=query, # Keyword search component vector_queries=[vector_query], # Vector search component query_type="semantic", # Enable semantic reranking semantic_configuration_name="my-semantic-config", top=top_k, filter=filter_expression, select=["chunk_id", "content", "page_number", "has_tables", "has_images", "content_types"] ) retrieved_chunks = [] for result in results: retrieved_chunks.append({ "content": result["content"], "page_number": result["page_number"], "score": result["@search.score"], "reranker_score": result.get("@search.reranker_score"), "has_tables": result["has_tables"], "has_images": result["has_images"] }) return retrieved_chunks

6. Generation with Context

Finally, we generate answers using GPT-4o with the retrieved multimodal context:

def generate_answer(query: str, retrieved_chunks: List[Dict]) -> str: """ Generate answer using GPT-4o with retrieved multimodal context """ # Build context from retrieved chunks context_parts = [] for idx, chunk in enumerate(retrieved_chunks, 1): context_parts.append(f"[Document {idx}] (Page {chunk['page_number']})") context_parts.append(chunk["content"]) context_parts.append("") # Blank line separator context = "\n".join(context_parts) # Create prompt system_prompt = """You are an AI assistant that answers questions based on provided document context. The context includes: - Text content from documents - Tables (in HTML format) - Image descriptions When answering: 1. Cite specific page numbers from the context 2. If information comes from a table, mention that explicitly 3. If information comes from an image, mention that explicitly 4. If the context doesn't contain enough information, say so clearly 5. Be precise and factual""" user_prompt = f"""Context from documents: {context} Question: {query} Please provide a comprehensive answer based on the context above.""" response = client.chat.completions.create( model="gpt-4o", messages=[ {"role": "system", "content": system_prompt}, {"role": "user", "content": user_prompt} ], temperature=0.3, # Lower for factual accuracy max_tokens=1000 ) return response.choices[0].message.content

End-to-End Pipeline

Here's how to tie it all together:

def process_document_pipeline(document_path: str, index_name: str): """ Complete pipeline: ingest → extract → chunk → embed → index """ print(f"Processing document: {document_path}") # Step 1: Extract content with Document Intelligence print("Extracting content...") extracted_data = analyze_document(document_path) # Step 2: Extract and describe images print("Processing images...") images = extract_images_from_pdf(document_path) images_with_descriptions = [] for img in images: description = generate_image_description( img["content"], context=f"Page {img['page_number']}" ) images_with_descriptions.append({ **img, "description": description }) # Step 3: Create multimodal chunks print("Creating chunks...") chunks = create_multimodal_chunks(extracted_data, images_with_descriptions) # Step 4: Index chunks print(f"Indexing {len(chunks)} chunks...") document_name = os.path.basename(document_path) index_chunks(chunks, index_name, document_name) print("✓ Processing complete!") def query_pipeline(query: str, index_name: str) -> Dict: """ Complete query pipeline: search → retrieve → generate """ print(f"Query: {query}") # Step 1: Hybrid search with reranking print("Searching...") retrieved_chunks = search_multimodal_rag(query, index_name, top_k=5) print(f"Retrieved {len(retrieved_chunks)} relevant chunks") # Step 2: Generate answer print("Generating answer...") answer = generate_answer(query, retrieved_chunks) return { "query": query, "answer": answer, "sources": [ { "page": chunk["page_number"], "score": chunk["score"], "has_tables": chunk["has_tables"], "has_images": chunk["has_images"] } for chunk in retrieved_chunks ] } # Usage example if __name__ == "__main__": # Initialize index index_name = "multimodal-rag-index" create_search_index(index_name) # Process documents process_document_pipeline("financial_report_2024.pdf", index_name) process_document_pipeline("technical_manual.pdf", index_name) # Query the system result = query_pipeline( "What were the revenue trends shown in Q3 charts?", index_name ) print(f"\nAnswer: {result['answer']}") print(f"\nSources: {result['sources']}")

Production Considerations

1. Cost Optimization

  • Document Intelligence: ~$10 per 1,000 pages for prebuilt-layout model
  • GPT-4 Vision: Can be expensive for many images. Consider:
    • Caching image descriptions
    • Using GPT-4o-mini for simple diagrams
    • Only processing images larger than threshold size
  • Embeddings: ada-002 is cost-effective (~$0.10/1M tokens)
  • Azure Cognitive Search: ~$250/month for Basic tier with 2GB

2. Performance Optimization

# Parallel processing for large document batches from concurrent.futures import ThreadPoolExecutor import asyncio async def process_documents_parallel(document_paths: List[str], index_name: str): with ThreadPoolExecutor(max_workers=4) as executor: futures = [ executor.submit(process_document_pipeline, path, index_name) for path in document_paths ] for future in futures: future.result()

3. Monitoring & Observability

import logging from azure.monitor.opentelemetry import configure_azure_monitor # Enable Application Insights configure_azure_monitor( connection_string=os.getenv("APPLICATIONINSIGHTS_CONNECTION_STRING") ) logger = logging.getLogger(__name__) def query_pipeline_with_logging(query: str, index_name: str) -> Dict: logger.info(f"Query received: {query}") start_time = time.time() result = query_pipeline(query, index_name) latency = time.time() - start_time logger.info(f"Query completed in {latency:.2f}s") logger.info(f"Retrieved {len(result['sources'])} sources") return result

4. Error Handling & Retry Logic

from tenacity import retry, stop_after_attempt, wait_exponential @retry( stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=4, max=10) ) def analyze_document_with_retry(document_path: str): """Retry Document Intelligence calls with exponential backoff""" return analyze_document(document_path)

Real-World Results

I've deployed similar systems for clients with impressive outcomes:

  • Financial Services Client: 45% reduction in document review time by accurately extracting data from tables and charts in 10K reports
  • Healthcare Provider: 60% improvement in medical record search relevance by including diagnostic image analysis
  • Legal Firm: Processed 100,000+ contracts with 92% accuracy in extracting key terms from complex layouts

Evaluation & Testing

# Sample evaluation using RAGAS metrics from ragas import evaluate from ragas.metrics import faithfulness, answer_relevancy, context_precision def evaluate_rag_system(test_cases: List[Dict]): """ Evaluate multimodal RAG using RAGAS framework """ results = [] for case in test_cases: result = query_pipeline(case["question"], "multimodal-rag-index") results.append({ "question": case["question"], "answer": result["answer"], "contexts": [chunk["content"] for chunk in result["sources"]], "ground_truth": case["expected_answer"] }) scores = evaluate( results, metrics=[faithfulness, answer_relevancy, context_precision] ) return scores

Next Steps & Future Enhancements

  1. Custom Document Intelligence Models: Train custom models for domain-specific document types
  2. Video Processing: Extend to video with Azure Video Indexer
  3. Cross-Modal Reasoning: Enable queries like "Find all charts related to this product specification"
  4. Streaming Responses: Implement streaming for better UX
  5. Fine-tuned Embeddings: Create domain-specific embedding models

Conclusion

Multimodal RAG with Azure AI unlocks the full value of enterprise documents by processing text, tables, and images in a unified system. The combination of Document Intelligence's layout understanding, GPT-4 Vision's image analysis, and Cognitive Search's hybrid retrieval creates a powerful foundation for production AI applications.

The architecture I've shared handles the complexity of real-world documents while remaining cost-effective and scalable. Whether you're building internal knowledge bases, customer support systems, or compliance tools, this approach provides a robust starting point.


Want to implement multimodal RAG for your organization? I offer consulting services for Azure AI implementations. Reach out via LinkedIn or check out my services page.

Code Repository: Full implementation code and examples available on my GitHub.