← Project hub New here? Read the concept notes first β†’
Complete Exam Coverage Β· All Domains

AI-901 Study Guide

Exam: July 5 2026, Manchester Piccadilly Β· Pass: 700/1000 Β· 45 min Β· No wrong-answer penalty
0 / 0
2
Domains
40–60
Questions
700
Pass Score
45m
Duration
No
Wrong Penalty
∞
Cert Expiry
Domain 1 Identify AI Concepts & Capabilities β€” 40–45%
Your compliance background covers ethical AI. Just learn Microsoft's specific six-principle framework and how they map to Azure features.
PrincipleWhat it meansAzure example
FairnessAI should treat all people equitablySentiment analysis must not penalise dialects; bias testing before deployment
Reliability & SafetyAI should perform reliably and safelyCheck result.reason before parsing; handle failures gracefully
Privacy & SecurityAI should be secure and respect privacyNever hardcode API keys; use DefaultAzureCredential or env vars
InclusivenessAI should empower everyoneSTT trained on diverse accents, speech impediments, background noise
TransparencyAI should be understandableDisclose when users are talking to AI (neural voice), not a human
AccountabilityPeople should be accountable for AILog AI decisions for audit; human oversight of high-stakes outputs
Memorise all six names. Questions will describe a scenario and ask which principle applies. "A TTS voice is so realistic customers think it's human" β†’ Transparency.

How generative AI models work

LLMs predict the next token based on probability distributions. They're trained on massive text corpora, then fine-tuned. They're stateless β€” full conversation history must be sent with every API call.

Key configuration parameters

ParameterWhat it controlsExam gotcha
temperatureRandomness of output. 0 = deterministic, 1+ = creativeTemperature 0 β‰  accurate. It means same input β†’ same output, NOT that the output is correct
top_pNucleus sampling β€” only consider tokens whose cumulative probability ≀ top_pLower top_p = more focused. Usually adjust one of temperature/top_p, not both
max_tokensMaximum length of generated responseShared with input context. "Context length exceeded" = too many tokens total
frequency_penaltyPenalises tokens proportionally to how often they've appearedReduces repetition
presence_penaltyPenalises tokens for appearing at all (flat penalty)Encourages new topics
stopSequences that halt generationUseful for structured output (lists, dialogs)

Deployment options in Foundry

Serverless (pay-per-token)

No reserved GPUs. Pay only for tokens consumed. Fast to start. Best for variable/low traffic.

vs

Dedicated endpoint

Reserved compute. Predictable latency. Better for high-volume, production workloads.

System prompt vs User prompt

Prompt typeWho sets itPurpose
System promptDeveloperDefines persona, rules, constraints, output format. Applied globally before any user input
User promptEnd userThe actual question or task
If asked where to put behavioural instructions ("respond formally", "only answer about policy"), the answer is always the system prompt β€” not the user prompt, not deployment settings.

Choosing an appropriate model

NeedModel type
Text generation, chat, reasoningGPT-4o, GPT-4o-mini, Meta Llama
Multimodal (text + image + audio)GPT-4o, phi-4-multimodal-instruct
Image generationDALL-E 3, GPT-image models
Embeddings (for RAG/search)text-embedding-ada-002, text-embedding-3-small
Lightweight/fast/cheapGPT-4o-mini, Phi models

The exam gives you a scenario and asks: which AI workload is this? Know these categories:

WorkloadScenario signalNot this
Generative AICreating content β€” writing, summarising, generating imagesIf it's taking actions β†’ agentic
Agentic AITaking actions β€” booking, emailing, searching, creating tickets, multi-step reasoningIf it's only creating content β†’ generative
Text analysisExtracting structured info from text β€” sentiment, entities, keywordsIf extracting from images/PDFs β†’ information extraction
SpeechConverting audio↔text, voice interactionIf analysing what was said β†’ text analysis
Computer vision"What is in this image?" β€” classification, detection, OCRIf extracting structured fields from a doc β†’ information extraction
Information extractionExtracting structured fields from documents, forms, images, audio, videoIf just reading text in an image β†’ OCR/vision
Key distinction tested: OCR (reads text from image) vs Content Understanding (extracts structured fields). "Scanned PDF" + "named structured fields" = information extraction, not computer vision.

Text analysis techniques

Keyword Extraction
Most salient terms for categorisation & search
Entity Detection (NER)
Classifies nouns β†’ orgs, people, locations, money, dates
Sentiment Analysis
Positive / negative / neutral + confidence score
Summarisation
Extractive = picks existing sentences. Abstractive = generates new text

Speech capabilities

Speech-to-Text (STT)

Audio β†’ text. Real-time or batch. Speaker diarization: identifies who said what.

vs

Text-to-Speech (TTS)

Text β†’ audio. Neural voices. SSML: XML markup for pitch, pace, pauses, multi-speaker.

Computer vision capabilities

Image Classification
What is this image? β†’ "cat", "dog", "car"
Object Detection
Where are objects? β†’ bounding boxes + labels
OCR
Extract printed/handwritten text from images
Face Detection
Locate faces, estimate attributes (not identification)

Information extraction techniques

Azure Content Understanding extracts structured data from: documents, forms, images, audio, and video. It replaces legacy Form Recognizer / Document Intelligence.

ML fundamentals (light coverage)

TypeWhat it doesExample
SupervisedLearns from labelled dataClassification (spam/not spam), Regression (predict price)
UnsupervisedFinds patterns in unlabelled dataClustering (customer segments)
ClassificationPredicts a categoryIs this email spam?
RegressionPredicts a numberWhat will the house sell for?
ClusteringGroups similar itemsCustomer segmentation
Domain 2 Implement AI Solutions Using Foundry β€” 55–60%

Foundry (formerly Azure AI Studio) is a unified PaaS ecosystem at ai.azure.com.

Foundry Hub
β†’
Project
β†’
Deployed Models
β†’
Endpoints
ComponentRole
HubTop-level admin boundary β€” security, networking, billing
ProjectWorkspace β€” models, vector indexes, datasets, agents
ConnectionsAuthenticated links to Azure Storage, AI Search, Speech, etc.
Endpointhttps://{name}.services.ai.azure.com/api/projects/{project}
Model CatalogBrowse & deploy OpenAI, Meta Llama, Phi, Mistral models
Foundry β‰  Azure OpenAI only. Foundry includes ALL Azure AI services β€” OpenAI models, Speech, Vision, Content Understanding, etc.

Terminology renames β€” don't get caught

Legacy (wrong on exam)Current (correct)
Azure AI StudioMicrosoft Foundry
Azure ADMicrosoft Entra ID
Azure Cognitive ServicesAzure AI services
Form RecognizerContent Understanding (via Document Intelligence)
LUISAzure AI Language CLU
Assistants APIResponses API (Agents v2)

This table is the highest-value thing to memorise. The exam swaps SDK responsibilities as distractors.

PackagePurposeKey classes
azure-ai-projectsFoundry orchestration β€” agents, tools, vector stores, evaluatorsAIProjectClient β†’ .agents, .evaluators, .vector_stores
azure-ai-inferenceChat completions, multimodal input (text + image + audio)ChatCompletionsClient, InputAudio, AudioContentItem, ImageContentItem
azure-cognitiveservices-speechDedicated STT and TTS via Azure SpeechSpeechConfig, SpeechRecognizer, SpeechSynthesizer
azure-ai-contentunderstandingStructured extraction from docs/images (replaces Form Recognizer)Analyzer config β†’ structured objects
azure.identityAuthentication across all SDKsDefaultAzureCredential()
Distractors will use azure-ai-inference for transcription or azure-cognitiveservices-speech for entity extraction. Know which SDK handles which job β€” they don't cross over.
You build multi-agent systems in Pydantic AI. AI-901 only tests single-agent understanding β€” simpler than what you already do.

Agent definition (testable)

An agent in Foundry = three components:

Model
+
Instructions
+
Tools

Model = from catalog (GPT-4o etc). Instructions = system prompt defining goals/behaviour. Tools = FileSearchTool, CodeInterpreter, external APIs.

Distractors: "Endpoint + API key + deployment name" (auth properties, not agent components). "Trigger + action + condition" (Power Automate, not agents). "Embedding + retriever + generator" (RAG pipeline, not agent definition).

Lightweight chat client SDK pattern

# 1. Auth from azure.ai.inference import ChatCompletionsClient from azure.identity import DefaultAzureCredential client = ChatCompletionsClient(endpoint, DefaultAzureCredential()) # 2. Send messages response = client.complete( model="gpt-4o", messages=[ {"role": "system", "content": "You are a helpful assistant"}, {"role": "user", "content": "Hello!"} ] ) print(response.choices[0].message.content)

Agent with tools (FileSearchTool + RAG)

# Create vector store and upload docs vector_store = client.agents.vector_stores.create() client.agents.vector_stores.upload(vector_store.id, files) # Bind FileSearchTool to agent tool = FileSearchTool(vector_store_id=vector_store.id) # Agent uses RAG to search docs autonomously
The exam explicitly tests: creating agents in the Foundry portal, deploying models from the catalog, testing prompts in the chat playground. You should have done this at least once at ai.azure.com before exam day.

In AI-901, text analysis is done via agents with constrained system prompts, not dedicated text analytics APIs:

AIProjectClient
β†’
.agents
β†’
System prompt constrains output
β†’
JSON entities out

For document-scale analysis, bind a FileSearchTool with a vector store. For structured extraction from scanned PDFs/forms, use Content Understanding (azure-ai-contentunderstanding).

Legacy pipeline (wrong approach on exam)

STT model
β†’
NLP model
β†’
TTS model

High latency. Strips emotional prosody β€” model only sees sterile text.

Modern multimodal (correct approach)

Raw audio/image bytes
β†’
gpt-4o / phi-4-multimodal
β†’
Response

Sub-second. Understands tone, sarcasm, urgency directly from waveforms.

If a question asks how to preserve emotional context from a user's voice, the answer is always multimodal (not STT→NLP→TTS). Distractor options will suggest transcribing first — that's the trap.
Do NOT manually Base64-encode audio into a custom JSON dict. The SDK handles serialization. Manual encoding β†’ "Invalid Input" error. This is a heavily tested distractor.

Correct class chain

raw bytes
β†’
InputAudio
β†’
AudioContentItem
β†’
UserMessage
β†’
.complete()
# Read audio as RAW BYTES (not base64!) with open("query.wav", "rb") as f: audio_bytes = f.read() # Wrap with format enum audio = InputAudio(data=audio_bytes, format=AudioContentFormat.WAV) audio_item = AudioContentItem(audio=audio) # Build message with text + audio msg = UserMessage(content=[ TextContentItem("Analyze the tone"), audio_item ]) response = client.complete(model="gpt-4o", messages=[msg])

Dedicated Speech services are still tested for enterprise STT/TTS, call centres, and SSML control.

Speech-to-Text

config = SpeechConfig( subscription=os.environ["AZURE_SPEECH_KEY"], region=os.environ["AZURE_SPEECH_REGION"] ) recognizer = SpeechRecognizer(speech_config=config) # Async β€” doesn't block main thread result = recognizer.recognize_once_async().get() # MUST check reason before text if result.reason == ResultReason.RecognizedSpeech: print(result.text)
recognize_once_async() is async. Exam tests whether you know it doesn't block. Always check result.reason before result.text.

Text-to-Speech + SSML

# Set output quality config.set_speech_synthesis_output_format( SpeechSynthesisOutputFormat.Riff24Khz16BitMonoPcm) # Set voice config.speech_synthesis_voice_name = "en-GB-George" synthesizer = SpeechSynthesizer(speech_config=config) # SSML for multi-speaker ssml = """<speak> <voice name="en-US-AriaNeural">Welcome!</voice> <voice name="en-GB-George">How can I help?</voice> </speak>""" synthesizer.speak_ssml_async(ssml).get()
SpeechSynthesisOutputFormat is for TTS audio quality ONLY. Distractors will use it as a "filter" for analytics or sentiment β€” it's not an analytics tool.

Interpreting visual input (multimodal)

Same pattern as audio β€” use azure-ai-inference to send images to a multimodal model:

Image bytes or URL
β†’
ImageContentItem
β†’
UserMessage
β†’
.complete()

Models: gpt-4o, phi-4-multimodal-instruct β€” same models that handle audio also handle images natively.

Generating images

Use generative image models (DALL-E 3 / GPT-image) via Foundry. Provide a text prompt, receive generated image. Common use: marketing visuals, product mockups, creative content.

Azure AI Vision capabilities (conceptual)

CapabilityWhat it does
Image classificationLabels the whole image ("beach", "office")
Object detectionLocates objects with bounding boxes
OCRExtracts text from images
Face detectionLocates faces, estimates age/emotion (not identify)
Spatial analysisAnalyses movement/position of people in video
Vision questions often test the boundary between "what's in this image" (computer vision) and "extract these specific fields from this document" (information extraction / Content Understanding).

Azure Content Understanding is a Foundry Tool for extracting structured information. It handles:

Documents & Forms
Key-value pairs, tables, signatures from PDFs, invoices, contracts
Images
Structured fields from photos of receipts, IDs, labels
Audio
Transcripts + structured extraction from recordings
Video
Scene analysis, transcript extraction, event detection

SDK: azure-ai-contentunderstanding. Define an analyzer config β†’ pass binary data β†’ iterate structured results.

Content Understanding is a centrepiece of AI-901's information extraction section. Run it against at least one PDF and one image before exam day so the SDK pattern is familiar.

Content Understanding vs OCR vs Vision

TaskTool
"Read the text in this image"OCR (Azure AI Vision)
"What objects are in this image?"Computer Vision
"Extract invoice number, date, total from this scanned PDF"Content Understanding
"Transcribe this audio and extract speaker names and dates mentioned"Content Understanding
PrincipleTextSpeechVision
FairnessSentiment must not penalise dialectsβ€”Face detection accuracy across skin tones
InclusivenessSupport multiple languagesSTT trained on diverse accents, impedimentsAccessibility for visually impaired users
TransparencyDisclose AI-generated contentDisclose neural voices are not humanDisclose AI-generated images
PrivacySecure handling of analysed textNever hardcode speech keysFace detection β‰  identification
ReliabilityCheck results before actingHandle silent audio, auth failuresHandle low-confidence predictions
AccountabilityLog analysis for auditLog synthesised interactionsHuman review of high-stakes decisions
TrapTruth
Temperature 0 = accurate answersTemperature 0 = deterministic (same input β†’ same output). Model can still hallucinate. For accuracy, use RAG or better prompts
Models remember previous conversationsModels are stateless. Full history must be sent with every API call
Manually Base64-encode audio for multimodalUse SDK classes (InputAudio β†’ AudioContentItem). SDK handles encoding
SpeechSynthesisOutputFormat extracts sentimentIt's a TTS output quality setting only
GPT-3.5 can ingest raw audioGPT-3.5 is text-only. Use GPT-4o or phi-4-multimodal for audio
SSML is for text formattingSSML is for speech synthesis control (pitch, pace, pauses, voices)
OCR = information extractionOCR reads text. Content Understanding extracts structured fields
Foundry = Azure OpenAIFoundry includes ALL Azure AI services, not just OpenAI
Behavioural rules go in user promptSystem prompt = developer rules. User prompt = end user questions
Agent = endpoint + API keyAgent = model + instructions + tools
Practice Questions
Q1: A developer sends a WAV file to a multimodal model but gets "Invalid Input". They're manually Base64-encoding audio into a custom JSON dict. What's the fix?
A. Convert to MP3 β€” WAV isn't supported.
B. Read as raw bytes β†’ InputAudio(format=AudioContentFormat.WAV) β†’ AudioContentItem β†’ UserMessage.
C. Upload Base64 string as a persistent dataset via azure-ai-projects, reference by GUID.
D. Use azure-cognitiveservices-speech to transcribe first, then pass text to the model.
B. The SDK handles serialization. Manual encoding causes schema errors. A is wrong β€” WAV is supported. C overcomplicates dataset tracking. D defeats multimodal purpose by stripping prosody.
Q2: A call centre needs to batch-transcribe recorded calls, then analyse transcripts for sentiment and entity extraction. Which SDK sequence?
A. azure-ai-inference for transcription β†’ azure-cognitiveservices-speech for entity extraction.
B. azure-cognitiveservices-speech for STT β†’ azure-ai-projects agent for text analysis.
C. Upload to Blob Storage β†’ apply SpeechSynthesisOutputFormat filter to extract sentiment.
D. azure-ai-inference with GPT-3.5 to ingest raw audio and return SSML-highlighted entities.
B. Speech SDK handles STT. Projects SDK handles text analysis via agents. A reverses responsibilities. C misuses SpeechSynthesisOutputFormat (it's TTS quality only). D fails β€” GPT-3.5 isn't multimodal, SSML isn't for text formatting.
Q3: A chatbot needs to book meetings, search knowledge bases, and send emails autonomously. Which workload category?
A. Generative AI β€” it's generating responses.
B. Agentic AI β€” it's taking actions autonomously.
C. Text analysis β€” it's processing language.
D. Information extraction β€” it's pulling data from sources.
B. Key distinction: if the AI is taking actions (booking, emailing, searching), it's agentic. If it's only creating content, it's generative.
Q4: A company processes thousands of scanned invoices and needs to extract invoice numbers, dates, and totals into structured JSON. Which Azure service?
A. Azure AI Vision (OCR) β€” it reads text from images.
B. Azure Content Understanding β€” it extracts structured fields from documents.
C. Azure AI Language (NER) β€” it detects entities in text.
D. GPT-4o with a system prompt to extract fields.
B. "Scanned PDF" + "named structured fields" = information extraction via Content Understanding. OCR just reads text. NER works on already-extracted text. GPT-4o could work but Content Understanding is the purpose-built tool at scale.
Q5: A developer sets temperature to 0 to ensure their medical Q&A bot gives correct answers. Is this sufficient?
A. Yes β€” temperature 0 ensures accurate, factual output.
B. No β€” temperature 0 makes output deterministic but the model can still hallucinate. Use RAG for accuracy.
C. No β€” temperature 0 is invalid; minimum is 0.1.
D. Yes β€” combined with max_tokens limit, it guarantees factual responses.
B. Temperature 0 = same input β†’ same output. It does NOT mean correct. The model can still hallucinate. For accuracy in medical scenarios, use grounding via RAG (Retrieval-Augmented Generation) with verified sources.
Q6: A TTS neural voice sounds so realistic that customers believe they're speaking to a human. Which responsible AI principle is violated?
A. Fairness β€” not all customers have equal access.
B. Transparency β€” users must be informed they're interacting with AI.
C. Inclusiveness β€” the voice doesn't support all languages.
D. Reliability β€” the voice might fail mid-conversation.
B. Transparency requires that users know when they're interacting with AI, not a human. Realistic neural voices must include disclosure.
Q7: What are the three components of an agent in Microsoft Foundry Agent Service?
A. Endpoint, API key, deployment name.
B. Embedding, retriever, generator.
C. Model, instructions, tools.
D. Trigger, action, condition.
C. Agent = model (from catalog) + instructions (system prompt) + tools (FileSearch, CodeInterpreter, APIs). A describes auth properties. B describes a RAG pipeline. D describes Power Automate workflows.
Q8: A team wants to deploy Meta Llama 3 for a chatbot with no GPU budget, starting today, paying only for tokens consumed. Which deployment option?
A. Serverless (pay-per-token) deployment in Foundry.
B. Dedicated endpoint with reserved compute.
C. Download the model and host on Azure VMs.
D. Use Azure Machine Learning managed endpoints.
A. Serverless = no reserved GPUs, pay only for tokens, fast to deploy. Dedicated endpoint requires reserved compute (budget). Self-hosting on VMs is the opposite of "no GPU budget."
Study Resources
ResourceLink
Official study guidelearn.microsoft.com/.../ai-901
Free training: AI concepts pathlearn.microsoft.com/.../ai-concepts
Free training: AI apps & agentslearn.microsoft.com/.../ai-apps-agents
Foundry Portalai.azure.com β€” explore before exam day
GitHub labsMicrosoftLearning/mslearn-ai-fundamentals
Tim Warner study repotimothywarner-org/ai901
Free practice: Tutorials Dojo (20 Qs)tutorialsdojo.com
Free practice: A Guide to Cloud (265 Qs)aguidetocloud.com
Free practice: Examinotion (5 worked Qs)examinotion.com