Advanced

Implement Knowledge Mining (15-20%)

Learn how to use Azure AI Search (formerly Cognitive Search) to extract insights from large volumes of unstructured data using indexers, AI skillsets, custom skills, and knowledge stores.

Azure AI Search Architecture

Azure AI Search is a cloud search service that provides full-text search, AI enrichment, and vector search capabilities. The core components are:

  • Data sources — Azure Blob Storage, Azure SQL, Cosmos DB, Azure Table Storage, or any data accessible via connectors
  • Indexers — Automated crawlers that read data from sources and populate the search index
  • Skillsets — AI enrichment pipelines that process data during indexing (OCR, NER, translation, etc.)
  • Index — The searchable data store with fields, types, and attributes
  • Knowledge store — Optional secondary output that saves enriched data to Azure Storage for downstream analytics
💡
The Enrichment Pipeline: Data Source → Indexer → Skillset (AI enrichment) → Index + Knowledge Store. Understanding this flow is critical for the exam. Every component connects to the next.

Indexers and Data Sources

Indexers automate the data ingestion process:

  • Scheduling — Run once, on a schedule (e.g., every 5 minutes), or on-demand via REST API
  • Change detection — Indexers track changes using high-water marks or soft delete policies to process only new/modified documents
  • Blob indexing — Supports PDF, DOCX, PPTX, XLSX, HTML, JSON, CSV, TXT, images (with OCR skill), and more
  • Field mappings — Map source fields to index fields when names differ or transformations are needed
  • Output field mappings — Map skillset output to index fields
// Indexer definition (REST API)
{
  "name": "my-indexer",
  "dataSourceName": "my-blob-datasource",
  "targetIndexName": "my-index",
  "skillsetName": "my-skillset",
  "schedule": {
    "interval": "PT2H"  // Run every 2 hours
  },
  "fieldMappings": [
    { "sourceFieldName": "metadata_storage_name",
      "targetFieldName": "documentName" }
  ],
  "outputFieldMappings": [
    { "sourceFieldName": "/document/organizations",
      "targetFieldName": "detectedOrganizations" }
  ]
}

Built-in Skills

Azure AI Search includes built-in cognitive skills for AI enrichment:

CategorySkillsDescription
VisionOCR, Image Analysis, Image tagsExtract text from images, generate captions and tags
LanguageEntity Recognition, Key Phrases, Sentiment, Language Detection, PII DetectionExtract entities, key phrases, sentiment from text
TranslationText TranslationTranslate text to target language during indexing
UtilityShaper, Merge, Text Split, ConditionalReshape data, merge fields, split large documents, conditional logic
DocumentDocument ExtractionExtract content and metadata from documents within a data source

Custom Skills

When built-in skills are not enough, you can create custom skills:

Custom Web API Skill

The most common custom skill type. It calls an external REST endpoint (typically an Azure Function) during indexing:

  • Receives a batch of records from the enrichment pipeline
  • Processes them (call an external ML model, apply business logic, query a database)
  • Returns enriched records in the expected JSON format
// Custom Web API Skill definition
{
  "@odata.type": "#Microsoft.Skills.Custom.WebApiSkill",
  "name": "custom-entity-lookup",
  "description": "Lookup entities in our product database",
  "uri": "https://my-function-app.azurewebsites.net/api/EntityLookup",
  "httpMethod": "POST",
  "timeout": "PT30S",
  "batchSize": 10,
  "context": "/document",
  "inputs": [
    { "name": "text", "source": "/document/content" }
  ],
  "outputs": [
    { "name": "productCodes", "targetName": "detectedProducts" }
  ]
}

Azure Machine Learning Skill

Call a deployed Azure ML model directly from the enrichment pipeline. Useful for running custom ML models (classification, NER, etc.) on indexed content.

Knowledge Store

A knowledge store saves enriched data to Azure Storage for use outside of search:

  • Table projections — Save structured data to Azure Table Storage (good for Power BI)
  • Object projections — Save JSON documents to Azure Blob Storage
  • File projections — Save binary files (images extracted during enrichment) to Blob Storage
💡
Exam Tip: The Shaper skill is commonly used to create the projection shapes for the knowledge store. It lets you define the exact JSON structure that will be projected to tables or objects.

Vector Search and Semantic Ranking

Modern search features in Azure AI Search:

  • Vector search — Store and query vector embeddings (from Azure OpenAI or other models) alongside traditional text
  • Hybrid search — Combine keyword (BM25) and vector search for better relevance
  • Semantic ranking — Re-rank search results using deep learning models for more relevant ordering
  • Integrated vectorization — Automatically generate embeddings during indexing using a vectorizer skill

Practice Questions

📝
Question 1: You have thousands of PDF documents in Azure Blob Storage. You need to make them searchable and extract key phrases and organizations from each document. What components do you need to configure in Azure AI Search?

A. Data source, index, and indexer only
B. Data source, index, indexer, and a skillset with Key Phrase Extraction and Entity Recognition skills
C. Data source and a custom Web API skill only
D. Index and a knowledge store only
Show Answer
Answer: B. You need all four components: a data source pointing to Blob Storage, a skillset with Key Phrase Extraction and Entity Recognition built-in skills, an index with fields for the enriched data, and an indexer connecting them all together with appropriate output field mappings.
📝
Question 2: During indexing, you need to call your company's proprietary classification model hosted on an Azure Function to tag each document with a product category. Which skill type should you use?

A. Built-in Entity Recognition skill
B. Custom Web API Skill pointing to the Azure Function
C. Shaper skill
D. Document Extraction skill
Show Answer
Answer: B. The Custom Web API Skill calls an external REST endpoint (Azure Function, App Service, or any web API) during the enrichment pipeline. It sends document content to your classification model and receives the product category back as enriched output.
📝
Question 3: You want to use the enriched data from Azure AI Search in Power BI dashboards. What should you configure?

A. Export the search index to a CSV file
B. Configure a knowledge store with table projections
C. Connect Power BI directly to the search index
D. Configure a knowledge store with file projections
Show Answer
Answer: B. Table projections in the knowledge store save structured enriched data to Azure Table Storage, which Power BI can connect to directly. Object projections save JSON to Blob Storage. File projections are for binary files like extracted images.
📝
Question 4: Your search index contains both text fields and vector embedding fields. You want to combine keyword matching with semantic similarity for the best search results. What search approach should you use?

A. Full-text search only
B. Vector search only
C. Hybrid search combining keyword and vector search
D. Fuzzy search with synonyms
Show Answer
Answer: C. Hybrid search combines BM25 keyword matching with vector similarity search using Reciprocal Rank Fusion (RRF) to merge results. This approach consistently outperforms either method alone, as keyword search handles exact matches while vector search captures semantic meaning.
📝
Question 5: Your indexer is configured to run every 2 hours. You notice it reprocesses all documents on every run instead of only new or changed documents. What should you configure?

A. Enable incremental enrichment with a cache
B. Decrease the indexer schedule interval
C. Add a change detection policy to the data source
D. Use a different data source type
Show Answer
Answer: C. A change detection policy (high-water mark or SQL integrated change tracking) tells the indexer how to identify new and modified documents. For Blob Storage, the indexer uses the metadata_storage_last_modified field by default. If this is not working, verify the data source configuration includes the correct change detection policy.

Key Takeaways

  • The enrichment pipeline flows: Data Source → Indexer → Skillset → Index + Knowledge Store.
  • Custom Web API Skills let you call Azure Functions or external REST APIs during indexing.
  • Knowledge store projections: tables (Power BI), objects (JSON), files (images).
  • The Shaper skill creates projection shapes for the knowledge store.
  • Hybrid search (keyword + vector) outperforms either method alone.
  • Output field mappings connect skillset outputs to index fields.