Implement Knowledge Mining (15-20%)
Learn how to use Azure AI Search (formerly Cognitive Search) to extract insights from large volumes of unstructured data using indexers, AI skillsets, custom skills, and knowledge stores.
Azure AI Search Architecture
Azure AI Search is a cloud search service that provides full-text search, AI enrichment, and vector search capabilities. The core components are:
- Data sources — Azure Blob Storage, Azure SQL, Cosmos DB, Azure Table Storage, or any data accessible via connectors
- Indexers — Automated crawlers that read data from sources and populate the search index
- Skillsets — AI enrichment pipelines that process data during indexing (OCR, NER, translation, etc.)
- Index — The searchable data store with fields, types, and attributes
- Knowledge store — Optional secondary output that saves enriched data to Azure Storage for downstream analytics
Indexers and Data Sources
Indexers automate the data ingestion process:
- Scheduling — Run once, on a schedule (e.g., every 5 minutes), or on-demand via REST API
- Change detection — Indexers track changes using high-water marks or soft delete policies to process only new/modified documents
- Blob indexing — Supports PDF, DOCX, PPTX, XLSX, HTML, JSON, CSV, TXT, images (with OCR skill), and more
- Field mappings — Map source fields to index fields when names differ or transformations are needed
- Output field mappings — Map skillset output to index fields
// Indexer definition (REST API)
{
"name": "my-indexer",
"dataSourceName": "my-blob-datasource",
"targetIndexName": "my-index",
"skillsetName": "my-skillset",
"schedule": {
"interval": "PT2H" // Run every 2 hours
},
"fieldMappings": [
{ "sourceFieldName": "metadata_storage_name",
"targetFieldName": "documentName" }
],
"outputFieldMappings": [
{ "sourceFieldName": "/document/organizations",
"targetFieldName": "detectedOrganizations" }
]
}
Built-in Skills
Azure AI Search includes built-in cognitive skills for AI enrichment:
| Category | Skills | Description |
|---|---|---|
| Vision | OCR, Image Analysis, Image tags | Extract text from images, generate captions and tags |
| Language | Entity Recognition, Key Phrases, Sentiment, Language Detection, PII Detection | Extract entities, key phrases, sentiment from text |
| Translation | Text Translation | Translate text to target language during indexing |
| Utility | Shaper, Merge, Text Split, Conditional | Reshape data, merge fields, split large documents, conditional logic |
| Document | Document Extraction | Extract content and metadata from documents within a data source |
Custom Skills
When built-in skills are not enough, you can create custom skills:
Custom Web API Skill
The most common custom skill type. It calls an external REST endpoint (typically an Azure Function) during indexing:
- Receives a batch of records from the enrichment pipeline
- Processes them (call an external ML model, apply business logic, query a database)
- Returns enriched records in the expected JSON format
// Custom Web API Skill definition
{
"@odata.type": "#Microsoft.Skills.Custom.WebApiSkill",
"name": "custom-entity-lookup",
"description": "Lookup entities in our product database",
"uri": "https://my-function-app.azurewebsites.net/api/EntityLookup",
"httpMethod": "POST",
"timeout": "PT30S",
"batchSize": 10,
"context": "/document",
"inputs": [
{ "name": "text", "source": "/document/content" }
],
"outputs": [
{ "name": "productCodes", "targetName": "detectedProducts" }
]
}
Azure Machine Learning Skill
Call a deployed Azure ML model directly from the enrichment pipeline. Useful for running custom ML models (classification, NER, etc.) on indexed content.
Knowledge Store
A knowledge store saves enriched data to Azure Storage for use outside of search:
- Table projections — Save structured data to Azure Table Storage (good for Power BI)
- Object projections — Save JSON documents to Azure Blob Storage
- File projections — Save binary files (images extracted during enrichment) to Blob Storage
Vector Search and Semantic Ranking
Modern search features in Azure AI Search:
- Vector search — Store and query vector embeddings (from Azure OpenAI or other models) alongside traditional text
- Hybrid search — Combine keyword (BM25) and vector search for better relevance
- Semantic ranking — Re-rank search results using deep learning models for more relevant ordering
- Integrated vectorization — Automatically generate embeddings during indexing using a vectorizer skill
Practice Questions
A. Data source, index, and indexer only
B. Data source, index, indexer, and a skillset with Key Phrase Extraction and Entity Recognition skills
C. Data source and a custom Web API skill only
D. Index and a knowledge store only
Show Answer
A. Built-in Entity Recognition skill
B. Custom Web API Skill pointing to the Azure Function
C. Shaper skill
D. Document Extraction skill
Show Answer
A. Export the search index to a CSV file
B. Configure a knowledge store with table projections
C. Connect Power BI directly to the search index
D. Configure a knowledge store with file projections
Show Answer
A. Full-text search only
B. Vector search only
C. Hybrid search combining keyword and vector search
D. Fuzzy search with synonyms
Show Answer
A. Enable incremental enrichment with a cache
B. Decrease the indexer schedule interval
C. Add a change detection policy to the data source
D. Use a different data source type
Show Answer
metadata_storage_last_modified field by default. If this is not working, verify the data source configuration includes the correct change detection policy.Key Takeaways
- The enrichment pipeline flows: Data Source → Indexer → Skillset → Index + Knowledge Store.
- Custom Web API Skills let you call Azure Functions or external REST APIs during indexing.
- Knowledge store projections: tables (Power BI), objects (JSON), files (images).
- The Shaper skill creates projection shapes for the knowledge store.
- Hybrid search (keyword + vector) outperforms either method alone.
- Output field mappings connect skillset outputs to index fields.