AI-901 Complete Study Guide

Domains

40–60

Questions

700

Pass Score

45m

Duration

Wrong Penalty

∞

Cert Expiry

Domain 1 Identify AI Concepts & Capabilities — 40–45%

Your compliance background covers ethical AI. Just learn Microsoft's specific six-principle framework and how they map to Azure features.

Principle	What it means	Azure example
Fairness	AI should treat all people equitably	Sentiment analysis must not penalise dialects; bias testing before deployment
Reliability & Safety	AI should perform reliably and safely	Check `result.reason` before parsing; handle failures gracefully
Privacy & Security	AI should be secure and respect privacy	Never hardcode API keys; use `DefaultAzureCredential` or env vars
Inclusiveness	AI should empower everyone	STT trained on diverse accents, speech impediments, background noise
Transparency	AI should be understandable	Disclose when users are talking to AI (neural voice), not a human
Accountability	People should be accountable for AI	Log AI decisions for audit; human oversight of high-stakes outputs

Memorise all six names. Questions will describe a scenario and ask which principle applies. "A TTS voice is so realistic customers think it's human" → Transparency.

How generative AI models work

LLMs predict the next token based on probability distributions. They're trained on massive text corpora, then fine-tuned. They're stateless — full conversation history must be sent with every API call.

Key configuration parameters

Parameter	What it controls	Exam gotcha
`temperature`	Randomness of output. 0 = deterministic, 1+ = creative	Temperature 0 ≠ accurate. It means same input → same output, NOT that the output is correct
`top_p`	Nucleus sampling — only consider tokens whose cumulative probability ≤ top_p	Lower top_p = more focused. Usually adjust one of temperature/top_p, not both
`max_tokens`	Maximum length of generated response	Shared with input context. "Context length exceeded" = too many tokens total
`frequency_penalty`	Penalises tokens proportionally to how often they've appeared	Reduces repetition
`presence_penalty`	Penalises tokens for appearing at all (flat penalty)	Encourages new topics
`stop`	Sequences that halt generation	Useful for structured output (lists, dialogs)

Deployment options in Foundry

Serverless (pay-per-token)

No reserved GPUs. Pay only for tokens consumed. Fast to start. Best for variable/low traffic.

Dedicated endpoint

Reserved compute. Predictable latency. Better for high-volume, production workloads.

System prompt vs User prompt

Prompt type	Who sets it	Purpose
System prompt	Developer	Defines persona, rules, constraints, output format. Applied globally before any user input
User prompt	End user	The actual question or task

If asked where to put behavioural instructions ("respond formally", "only answer about policy"), the answer is always the system prompt — not the user prompt, not deployment settings.

Choosing an appropriate model

Need	Model type
Text generation, chat, reasoning	GPT-4o, GPT-4o-mini, Meta Llama
Multimodal (text + image + audio)	GPT-4o, phi-4-multimodal-instruct
Image generation	DALL-E 3, GPT-image models
Embeddings (for RAG/search)	text-embedding-ada-002, text-embedding-3-small
Lightweight/fast/cheap	GPT-4o-mini, Phi models

The exam gives you a scenario and asks: which AI workload is this? Know these categories:

Workload	Scenario signal	Not this
Generative AI	Creating content — writing, summarising, generating images	If it's taking actions → agentic
Agentic AI	Taking actions — booking, emailing, searching, creating tickets, multi-step reasoning	If it's only creating content → generative
Text analysis	Extracting structured info from text — sentiment, entities, keywords	If extracting from images/PDFs → information extraction
Speech	Converting audio↔text, voice interaction	If analysing what was said → text analysis
Computer vision	"What is in this image?" — classification, detection, OCR	If extracting structured fields from a doc → information extraction
Information extraction	Extracting structured fields from documents, forms, images, audio, video	If just reading text in an image → OCR/vision

Key distinction tested: OCR (reads text from image) vs Content Understanding (extracts structured fields). "Scanned PDF" + "named structured fields" = information extraction, not computer vision.

Text analysis techniques

Keyword Extraction

Most salient terms for categorisation & search

Entity Detection (NER)

Classifies nouns → orgs, people, locations, money, dates

Sentiment Analysis

Positive / negative / neutral + confidence score

Summarisation

Extractive = picks existing sentences. Abstractive = generates new text

Speech capabilities

Speech-to-Text (STT)

Audio → text. Real-time or batch. Speaker diarization: identifies who said what.

Text-to-Speech (TTS)

Text → audio. Neural voices. SSML: XML markup for pitch, pace, pauses, multi-speaker.

Computer vision capabilities

Image Classification

What is this image? → "cat", "dog", "car"

Object Detection

Where are objects? → bounding boxes + labels

OCR

Extract printed/handwritten text from images

Face Detection

Locate faces, estimate attributes (not identification)

Information extraction techniques

Azure Content Understanding extracts structured data from: documents, forms, images, audio, and video. It replaces legacy Form Recognizer / Document Intelligence.

ML fundamentals (light coverage)

Type	What it does	Example
Supervised	Learns from labelled data	Classification (spam/not spam), Regression (predict price)
Unsupervised	Finds patterns in unlabelled data	Clustering (customer segments)
Classification	Predicts a category	Is this email spam?
Regression	Predicts a number	What will the house sell for?
Clustering	Groups similar items	Customer segmentation

Domain 2 Implement AI Solutions Using Foundry — 55–60%

Foundry (formerly Azure AI Studio) is a unified PaaS ecosystem at ai.azure.com.

Foundry Hub

→

Project

→

Deployed Models

→

Endpoints

Component	Role
Hub	Top-level admin boundary — security, networking, billing
Project	Workspace — models, vector indexes, datasets, agents
Connections	Authenticated links to Azure Storage, AI Search, Speech, etc.
Endpoint	`https://{name}.services.ai.azure.com/api/projects/{project}`
Model Catalog	Browse & deploy OpenAI, Meta Llama, Phi, Mistral models

Foundry ≠ Azure OpenAI only. Foundry includes ALL Azure AI services — OpenAI models, Speech, Vision, Content Understanding, etc.

Terminology renames — don't get caught

Legacy (wrong on exam)	Current (correct)
Azure AI Studio	Microsoft Foundry
Azure AD	Microsoft Entra ID
Azure Cognitive Services	Azure AI services
Form Recognizer	Content Understanding (via Document Intelligence)
LUIS	Azure AI Language CLU
Assistants API	Responses API (Agents v2)

This table is the highest-value thing to memorise. The exam swaps SDK responsibilities as distractors.

Package	Purpose	Key classes
`azure-ai-projects`	Foundry orchestration — agents, tools, vector stores, evaluators	`AIProjectClient` → `.agents`, `.evaluators`, `.vector_stores`
`azure-ai-inference`	Chat completions, multimodal input (text + image + audio)	`ChatCompletionsClient`, `InputAudio`, `AudioContentItem`, `ImageContentItem`
`azure-cognitiveservices-speech`	Dedicated STT and TTS via Azure Speech	`SpeechConfig`, `SpeechRecognizer`, `SpeechSynthesizer`
`azure-ai-contentunderstanding`	Structured extraction from docs/images (replaces Form Recognizer)	Analyzer config → structured objects
`azure.identity`	Authentication across all SDKs	`DefaultAzureCredential()`

Distractors will use azure-ai-inference for transcription or azure-cognitiveservices-speech for entity extraction. Know which SDK handles which job — they don't cross over.

You build multi-agent systems in Pydantic AI. AI-901 only tests single-agent understanding — simpler than what you already do.

Agent definition (testable)

An agent in Foundry = three components:

Model

Instructions

Tools

Model = from catalog (GPT-4o etc). Instructions = system prompt defining goals/behaviour. Tools = FileSearchTool, CodeInterpreter, external APIs.

Distractors: "Endpoint + API key + deployment name" (auth properties, not agent components). "Trigger + action + condition" (Power Automate, not agents). "Embedding + retriever + generator" (RAG pipeline, not agent definition).

Lightweight chat client SDK pattern

# 1. Auth
from azure.ai.inference import ChatCompletionsClient
from azure.identity import DefaultAzureCredential

client = ChatCompletionsClient(endpoint, DefaultAzureCredential())

# 2. Send messages
response = client.complete(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "You are a helpful assistant"},
        {"role": "user", "content": "Hello!"}
    ]
)
print(response.choices[0].message.content)

Agent with tools (FileSearchTool + RAG)

# Create vector store and upload docs
vector_store = client.agents.vector_stores.create()
client.agents.vector_stores.upload(vector_store.id, files)

# Bind FileSearchTool to agent
tool = FileSearchTool(vector_store_id=vector_store.id)

# Agent uses RAG to search docs autonomously

The exam explicitly tests: creating agents in the Foundry portal, deploying models from the catalog, testing prompts in the chat playground. You should have done this at least once at ai.azure.com before exam day.

In AI-901, text analysis is done via agents with constrained system prompts, not dedicated text analytics APIs:

AIProjectClient

→

.agents

→

System prompt constrains output

→

JSON entities out

For document-scale analysis, bind a FileSearchTool with a vector store. For structured extraction from scanned PDFs/forms, use Content Understanding (azure-ai-contentunderstanding).

Legacy pipeline (wrong approach on exam)

STT model

→

NLP model

→

TTS model

High latency. Strips emotional prosody — model only sees sterile text.

Modern multimodal (correct approach)

Raw audio/image bytes

→

gpt-4o / phi-4-multimodal

→

Response

Sub-second. Understands tone, sarcasm, urgency directly from waveforms.

If a question asks how to preserve emotional context from a user's voice, the answer is always multimodal (not STT→NLP→TTS). Distractor options will suggest transcribing first — that's the trap.

Do NOT manually Base64-encode audio into a custom JSON dict. The SDK handles serialization. Manual encoding → "Invalid Input" error. This is a heavily tested distractor.

Correct class chain

raw bytes

→

InputAudio

→

AudioContentItem

→

UserMessage

→

.complete()

# Read audio as RAW BYTES (not base64!)
with open("query.wav", "rb") as f:
    audio_bytes = f.read()

# Wrap with format enum
audio = InputAudio(data=audio_bytes, format=AudioContentFormat.WAV)
audio_item = AudioContentItem(audio=audio)

# Build message with text + audio
msg = UserMessage(content=[
    TextContentItem("Analyze the tone"),
    audio_item
])

response = client.complete(model="gpt-4o", messages=[msg])

Dedicated Speech services are still tested for enterprise STT/TTS, call centres, and SSML control.

Speech-to-Text

config = SpeechConfig(
    subscription=os.environ["AZURE_SPEECH_KEY"],
    region=os.environ["AZURE_SPEECH_REGION"]
)
recognizer = SpeechRecognizer(speech_config=config)

# Async — doesn't block main thread
result = recognizer.recognize_once_async().get()

# MUST check reason before text
if result.reason == ResultReason.RecognizedSpeech:
    print(result.text)

recognize_once_async() is async. Exam tests whether you know it doesn't block. Always check result.reason before result.text.

Text-to-Speech + SSML

# Set output quality
config.set_speech_synthesis_output_format(
    SpeechSynthesisOutputFormat.Riff24Khz16BitMonoPcm)
# Set voice
config.speech_synthesis_voice_name = "en-GB-George"

synthesizer = SpeechSynthesizer(speech_config=config)

# SSML for multi-speaker
ssml = """<speak>
  <voice name="en-US-AriaNeural">Welcome!</voice>
  <voice name="en-GB-George">How can I help?</voice>
</speak>"""

synthesizer.speak_ssml_async(ssml).get()

SpeechSynthesisOutputFormat is for TTS audio quality ONLY. Distractors will use it as a "filter" for analytics or sentiment — it's not an analytics tool.

Interpreting visual input (multimodal)

Same pattern as audio — use azure-ai-inference to send images to a multimodal model:

Image bytes or URL

→

ImageContentItem

→

UserMessage

→

.complete()

Models: gpt-4o, phi-4-multimodal-instruct — same models that handle audio also handle images natively.

Generating images

Use generative image models (DALL-E 3 / GPT-image) via Foundry. Provide a text prompt, receive generated image. Common use: marketing visuals, product mockups, creative content.

Azure AI Vision capabilities (conceptual)

Capability	What it does
Image classification	Labels the whole image ("beach", "office")
Object detection	Locates objects with bounding boxes
OCR	Extracts text from images
Face detection	Locates faces, estimates age/emotion (not identify)
Spatial analysis	Analyses movement/position of people in video

Vision questions often test the boundary between "what's in this image" (computer vision) and "extract these specific fields from this document" (information extraction / Content Understanding).

Azure Content Understanding is a Foundry Tool for extracting structured information. It handles:

Documents & Forms

Key-value pairs, tables, signatures from PDFs, invoices, contracts

Images

Structured fields from photos of receipts, IDs, labels

Audio

Transcripts + structured extraction from recordings

Video

Scene analysis, transcript extraction, event detection

SDK: azure-ai-contentunderstanding. Define an analyzer config → pass binary data → iterate structured results.

Content Understanding is a centrepiece of AI-901's information extraction section. Run it against at least one PDF and one image before exam day so the SDK pattern is familiar.

Content Understanding vs OCR vs Vision

Task	Tool
"Read the text in this image"	OCR (Azure AI Vision)
"What objects are in this image?"	Computer Vision
"Extract invoice number, date, total from this scanned PDF"	Content Understanding
"Transcribe this audio and extract speaker names and dates mentioned"	Content Understanding

Principle	Text	Speech	Vision
Fairness	Sentiment must not penalise dialects	—	Face detection accuracy across skin tones
Inclusiveness	Support multiple languages	STT trained on diverse accents, impediments	Accessibility for visually impaired users
Transparency	Disclose AI-generated content	Disclose neural voices are not human	Disclose AI-generated images
Privacy	Secure handling of analysed text	Never hardcode speech keys	Face detection ≠ identification
Reliability	Check results before acting	Handle silent audio, auth failures	Handle low-confidence predictions
Accountability	Log analysis for audit	Log synthesised interactions	Human review of high-stakes decisions

Trap	Truth
Temperature 0 = accurate answers	Temperature 0 = deterministic (same input → same output). Model can still hallucinate. For accuracy, use RAG or better prompts
Models remember previous conversations	Models are stateless. Full history must be sent with every API call
Manually Base64-encode audio for multimodal	Use SDK classes (`InputAudio` → `AudioContentItem`). SDK handles encoding
SpeechSynthesisOutputFormat extracts sentiment	It's a TTS output quality setting only
GPT-3.5 can ingest raw audio	GPT-3.5 is text-only. Use GPT-4o or phi-4-multimodal for audio
SSML is for text formatting	SSML is for speech synthesis control (pitch, pace, pauses, voices)
OCR = information extraction	OCR reads text. Content Understanding extracts structured fields
Foundry = Azure OpenAI	Foundry includes ALL Azure AI services, not just OpenAI
Behavioural rules go in user prompt	System prompt = developer rules. User prompt = end user questions
Agent = endpoint + API key	Agent = model + instructions + tools

Practice Questions

Q1: A developer sends a WAV file to a multimodal model but gets "Invalid Input". They're manually Base64-encoding audio into a custom JSON dict. What's the fix?

A. Convert to MP3 — WAV isn't supported.

B. Read as raw bytes → InputAudio(format=AudioContentFormat.WAV) → AudioContentItem → UserMessage.

C. Upload Base64 string as a persistent dataset via azure-ai-projects, reference by GUID.

D. Use azure-cognitiveservices-speech to transcribe first, then pass text to the model.

B. The SDK handles serialization. Manual encoding causes schema errors. A is wrong — WAV is supported. C overcomplicates dataset tracking. D defeats multimodal purpose by stripping prosody.

Q2: A call centre needs to batch-transcribe recorded calls, then analyse transcripts for sentiment and entity extraction. Which SDK sequence?

A. azure-ai-inference for transcription → azure-cognitiveservices-speech for entity extraction.

B. azure-cognitiveservices-speech for STT → azure-ai-projects agent for text analysis.

C. Upload to Blob Storage → apply SpeechSynthesisOutputFormat filter to extract sentiment.

D. azure-ai-inference with GPT-3.5 to ingest raw audio and return SSML-highlighted entities.

B. Speech SDK handles STT. Projects SDK handles text analysis via agents. A reverses responsibilities. C misuses SpeechSynthesisOutputFormat (it's TTS quality only). D fails — GPT-3.5 isn't multimodal, SSML isn't for text formatting.

Q3: A chatbot needs to book meetings, search knowledge bases, and send emails autonomously. Which workload category?

A. Generative AI — it's generating responses.

B. Agentic AI — it's taking actions autonomously.

C. Text analysis — it's processing language.

D. Information extraction — it's pulling data from sources.

B. Key distinction: if the AI is taking actions (booking, emailing, searching), it's agentic. If it's only creating content, it's generative.

Q4: A company processes thousands of scanned invoices and needs to extract invoice numbers, dates, and totals into structured JSON. Which Azure service?

A. Azure AI Vision (OCR) — it reads text from images.

B. Azure Content Understanding — it extracts structured fields from documents.

C. Azure AI Language (NER) — it detects entities in text.

D. GPT-4o with a system prompt to extract fields.

B. "Scanned PDF" + "named structured fields" = information extraction via Content Understanding. OCR just reads text. NER works on already-extracted text. GPT-4o could work but Content Understanding is the purpose-built tool at scale.

Q5: A developer sets temperature to 0 to ensure their medical Q&A bot gives correct answers. Is this sufficient?

A. Yes — temperature 0 ensures accurate, factual output.

B. No — temperature 0 makes output deterministic but the model can still hallucinate. Use RAG for accuracy.

C. No — temperature 0 is invalid; minimum is 0.1.

D. Yes — combined with max_tokens limit, it guarantees factual responses.

B. Temperature 0 = same input → same output. It does NOT mean correct. The model can still hallucinate. For accuracy in medical scenarios, use grounding via RAG (Retrieval-Augmented Generation) with verified sources.

Q6: A TTS neural voice sounds so realistic that customers believe they're speaking to a human. Which responsible AI principle is violated?

A. Fairness — not all customers have equal access.

B. Transparency — users must be informed they're interacting with AI.

C. Inclusiveness — the voice doesn't support all languages.

D. Reliability — the voice might fail mid-conversation.

B. Transparency requires that users know when they're interacting with AI, not a human. Realistic neural voices must include disclosure.

Q7: What are the three components of an agent in Microsoft Foundry Agent Service?

A. Endpoint, API key, deployment name.

B. Embedding, retriever, generator.

C. Model, instructions, tools.

D. Trigger, action, condition.

C. Agent = model (from catalog) + instructions (system prompt) + tools (FileSearch, CodeInterpreter, APIs). A describes auth properties. B describes a RAG pipeline. D describes Power Automate workflows.

Q8: A team wants to deploy Meta Llama 3 for a chatbot with no GPU budget, starting today, paying only for tokens consumed. Which deployment option?

A. Serverless (pay-per-token) deployment in Foundry.

B. Dedicated endpoint with reserved compute.

C. Download the model and host on Azure VMs.

D. Use Azure Machine Learning managed endpoints.

A. Serverless = no reserved GPUs, pay only for tokens, fast to deploy. Dedicated endpoint requires reserved compute (budget). Self-hosting on VMs is the opposite of "no GPU budget."

Study Resources

Resource	Link
Official study guide	learn.microsoft.com/.../ai-901
Free training: AI concepts path	learn.microsoft.com/.../ai-concepts
Free training: AI apps & agents	learn.microsoft.com/.../ai-apps-agents
Foundry Portal	ai.azure.com — explore before exam day
GitHub labs	MicrosoftLearning/mslearn-ai-fundamentals
Tim Warner study repo	timothywarner-org/ai901
Free practice: Tutorials Dojo (20 Qs)	tutorialsdojo.com
Free practice: A Guide to Cloud (265 Qs)	aguidetocloud.com
Free practice: Examinotion (5 worked Qs)	examinotion.com

AI-901 Study Guide

How generative AI models work

Key configuration parameters

Deployment options in Foundry

Serverless (pay-per-token)

Dedicated endpoint

System prompt vs User prompt

Choosing an appropriate model

Text analysis techniques

Speech capabilities

Speech-to-Text (STT)

Text-to-Speech (TTS)

Computer vision capabilities

Information extraction techniques

ML fundamentals (light coverage)

Terminology renames — don't get caught

Agent definition (testable)

Lightweight chat client SDK pattern

Agent with tools (FileSearchTool + RAG)

Legacy pipeline (wrong approach on exam)

Modern multimodal (correct approach)

Correct class chain

Speech-to-Text

Text-to-Speech + SSML

Interpreting visual input (multimodal)

Generating images

Azure AI Vision capabilities (conceptual)

Content Understanding vs OCR vs Vision