← Project hub 1 · Reading (concepts) 2 · Study guide (traps)
Read this first · All eight objective areas

AI-901 Reading Companion

Plain-language notes in Microsoft's current terminology · Exam: 5 July 2026 · Pass: 700/1000
0 / 0 read

How to use this page. Read every section top to bottom and tick each one as you finish (your progress saves in this browser). The goal here is not tricks — it is exposure to the exact names Microsoft uses. The exam rewards matching a scenario to the one correct service or capability, so the bold terms and the purple "Say it the Microsoft way" boxes are the things to lock in.

Order of study: 1) this reading → 2) the study guide (traps and quick-fire tables) → 3) hands-on labs → 4) re-test.

Domain 1 Identify AI concepts and capabilities — 40–45%

Microsoft groups responsible AI into six principles. The exam describes a situation and asks which principle it concerns. Learn the six names and the one-line idea behind each.

PrincipleCore ideaScenario signal
FairnessThe system treats all people and groups equitably.A group is advantaged or disadvantaged for a reason unrelated to the task (gender, ethnicity, dialect, which school they attended).
Reliability and safetyThe system behaves consistently and safely, including in unexpected conditions.The model must perform predictably, fail gracefully, and be tested before release. Self-driving, medical, and financial scenarios.
Privacy and securityPersonal data is protected and access is controlled.Keeping customer data confidential, consent, securing keys and access to the model.
InclusivenessThe system empowers everyone and engages people of all abilities.Accessibility, support for many languages, accents, and abilities; not leaving groups out.
TransparencyPeople understand how the system works and when they are dealing with AI.Disclosing that a voice or chat is AI, explaining how a decision was reached, documenting limitations.
AccountabilityPeople remain responsible and answerable for how the AI behaves.Human oversight of high-stakes outputs, audit logging, governance, someone owns the outcome.
There are exactly six principles. If an answer option invents a seventh (for example "efficiency" or "profitability"), it is wrong. "Reliability and safety" and "privacy and security" are each one principle, not four.
Watch the difference between fairness (a group is treated worse for an irrelevant reason) and inclusiveness (a group is left out or cannot access the service at all). A hiring tool scoring one university lower = fairness. A speech app that fails for users with a speech impediment = inclusiveness.
A realistic AI voice that makes customers believe they are talking to a human is a transparency problem (they were not told it is AI), not reliability.

How generative AI models work

A large language model (LLM) is trained on a very large body of text. It breaks text into tokens (word pieces) and represents meaning as numeric vectors called embeddings. At its core the model does one thing: it predicts the next token, one token at a time, based on probability. Modern models use the transformer architecture, which uses attention to weigh which earlier tokens matter most.

  • Models are stateless. They do not remember past calls. The full conversation history is sent again on every request, and that history counts against the token limit.
  • The amount of text a model can consider at once is the context window, measured in tokens (it covers your prompt plus the response).
  • Models can hallucinate (produce confident but wrong content). To improve accuracy you ground the model in trusted data using retrieval-augmented generation (RAG), not by lowering temperature.
  • Fine-tuning further trains a base model on your own examples to specialise its behaviour. Multimodal models accept more than text — images, audio, or both.

Choosing an appropriate model

Pick the model by what the task needs, then by cost, speed, and context size.

You need to…Choose
Chat, write, summarise, reason over textA text/chat model (for example GPT-4o, GPT-4o-mini)
Understand images or audio as inputA multimodal model (for example GPT-4o, Phi multimodal)
Create images from a text promptAn image-generation model (for example DALL-E)
Turn text into vectors for search or RAGAn embedding model (for example text-embedding-3)
Keep cost and latency lowA small model (a "mini" or Phi small language model)
A bigger model is not automatically the right answer. If the scenario stresses low cost, low latency, or high volume of simple requests, the correct pick is a small / mini model.

Deployment options and configuration parameters

In Microsoft Foundry you deploy a model from the model catalog, which gives you an endpoint to call. Two broad deployment styles:

  • Serverless / standard (pay-as-you-go) — no infrastructure to manage, you pay per token. Fast to start, good for variable or low traffic.
  • Managed compute (dedicated) — you reserve compute for predictable, high-volume workloads.

You may also choose a region or data-zone for data residency. After deployment you tune inference parameters:

ParameterWhat it controls
temperatureRandomness. Low (near 0) = focused and deterministic; high = more varied and creative.
top_pNucleus sampling — restricts choices to the most probable tokens. Tune temperature or top_p, not both.
max tokensCaps the length of the response.
frequency / presence penaltyDiscourage repetition / encourage new topics.
stop sequencesText that tells the model to stop generating.
temperature = 0 means deterministic (same input gives the same output). It does not mean "accurate." A model at temperature 0 can still be wrong. For factual accuracy you ground with RAG.

Naming the workload (1.3.1)

The exam gives a scenario and asks which workload it is. Decide by what the system does with the input, not by what the input is.

WorkloadThe system is…Not this
Generative AICreating new content — text, summaries, images, code.If it takes actions on your behalf → agentic.
Agentic AITaking multi-step actions autonomously — booking, emailing, calling tools.If it only writes a reply → generative.
Text analysis (natural language processing)Pulling meaning from text — sentiment, entities, key phrases, language.If the source is a scanned form or image → information extraction.
SpeechConverting between audio and text, or speaking.If analysing the words after transcription → text analysis.
Computer visionInterpreting a scene — classifying an image, detecting objects, reading text (OCR).If extracting named fields from a document → information extraction.
Information extractionPulling structured fields out of documents, images, audio, or video.If only reading raw text from an image → OCR (computer vision).
An input being an image does not make it computer vision. Reading the supplier, date, and total out of a scanned invoice is information extraction, even though the input is a picture.

Text analysis techniques (1.3.2)

Sentiment analysis
Tone of the text: positive, negative, or neutral, with confidence scores.
Key phrase extraction
The main talking points or topics in the text.
Entity detection (NER)
Names of people, places, organisations, dates, quantities.
Summarisation
A shorter version: extractive (picks key sentences) or abstractive (writes new text).

Other capabilities in Azure AI Language: language detection (which language), PII detection (find and redact personal data), entity linking, conversational language understanding (CLU) for intent and entities, and custom question answering for a Q&A knowledge base.

Use Azure AI Language, never Text Analytics or Cognitive Services. CLU replaces LUIS; custom question answering replaces QnA Maker.

Speech (1.3.3)

  • Speech-to-text (speech recognition) — audio in, text out. Real-time or batch.
  • Text-to-speech (speech synthesis) — text in, spoken audio out, using neural voices.
  • SSML (Speech Synthesis Markup Language) — XML markup to control pitch, pace, pauses, and switch voices.
  • Speech translation — spoken word in one language to another.
The service is Azure AI Speech. "Recognition" = speech-to-text. "Synthesis" = text-to-speech.

Computer vision and image generation (1.3.4)

Image classification
What is this a picture of? One label for the whole image.
Object detection
Which objects, and where? Bounding boxes plus labels.
OCR (Read)
Extract printed or handwritten text that appears in an image.
Face detection
Locate faces and estimate attributes (detection, not identification).

Image generation is the reverse: a generative model (for example DALL-E) creates a new image from a text prompt. Recognising versus creating an image is a common exam split.

The service is Azure AI Vision. Training your own image categories is Azure AI Vision custom (this replaced Custom Vision).

Extracting information from text, images, audio, and video (1.3.5)

When the goal is structured fields out of mixed content, the workload is information extraction. Across documents, images, audio, and video this is the job of Azure Content Understanding. Forms and documents specifically can also use Azure AI Document Intelligence.

Azure AI Document Intelligence replaced Form Recognizer. Azure Content Understanding is the newer multimodal service that also handles images, audio, and video.
Domain 2 Implement AI solutions by using Microsoft Foundry — 55–60%

The Foundry mental model

Microsoft Foundry is the unified portal at ai.azure.com for building AI solutions. You work inside a Foundry project, deploy models from the model catalog, experiment in the playground, and call deployed models from code through the Foundry SDK.

Use Microsoft Foundry, never Azure AI Studio or Azure AI Foundry. Foundry includes all Azure AI services, not only OpenAI models. Sign-in identity is Microsoft Entra ID, never Azure AD.

System prompt vs user prompt (2.1.1)

PromptWho writes itPurpose
System prompt (system message)The developerSets persona, rules, tone, constraints, and output format. Applied before any user input.
User promptThe end userThe actual question or task.
Any "always respond formally / only answer about policy / reply in JSON" instruction goes in the system prompt, not the user prompt and not a deployment setting.

Deploy and interact in the portal (2.1.2)

From the model catalog you select a model, deploy it, and test it in the chat playground before writing any code. Deploying gives you an endpoint and a deployment name.

Lightweight chat client with the Foundry SDK (2.1.3)

The build lessons call Foundry from Python using the azure-ai-projects package and authenticate with DefaultAzureCredential from azure.identitykeyless sign-in through Microsoft Entra after az login. The only configuration value you supply is the project endpoint.

Preferred auth is keyless with DefaultAzureCredential, not a hard-coded API key. Never put a key in code.

Single agent in the portal (2.1.4) and its client app (2.1.5)

An agent in Foundry is exactly three parts:

ComponentMeaning
ModelThe deployed model from the catalog that powers the agent.
InstructionsThe system prompt that defines the agent's goal and behaviour.
ToolsCapabilities you attach so the agent can do more than chat.

The tools you can attach are a frequent exam topic:

File Search (grounding)
RAG over documents you upload. "Answer from my PDFs."
Function / tool calling
Call your own code or API to take an action or fetch live data.
Code Interpreter
Run Python in a sandbox for maths, data, and charts.
Bing / web grounding
Ground answers in live public web results.

A client app for an agent uses the azure-ai-agents capabilities of the Foundry SDK, again with DefaultAzureCredential.

"Answer using my uploaded documents" = File Search (RAG), not function calling. Function calling is for taking actions or pulling live external data through your own code.

Text analysis app (2.2.1)

You build a lightweight app that analyses text — for example classifying sentiment, pulling key phrases, or extracting entities — using a deployed model in Foundry or Azure AI Language capabilities. Match the requested task to the right capability (sentiment for tone, key phrases for topics, NER for names).

Respond to spoken prompts with a multimodal model (2.2.2)

A multimodal model (for example GPT-4o) can take audio directly as input and respond, without a separate transcription step. This preserves tone and intent that a text-only pipeline would lose.

If a question asks how to preserve the emotion or tone of a caller's voice, the answer is a multimodal model that ingests the audio — not a speech-to-text → analyse → speak pipeline, which strips prosody.

Azure Speech in Foundry Tools (2.2.3)

For dedicated, enterprise speech — call-centre transcription, custom neural voices, fine SSML control — you use Azure AI Speech through Microsoft Foundry Tools: speech-to-text for transcription and text-to-speech with SSML for synthesis.

Two valid paths to "speech." A multimodal model for conversational, tone-aware responses; Azure AI Speech for dedicated transcription and synthesis with voice and SSML control. Read the scenario to pick.

Interpret visual input with a multimodal model (2.3.1)

Send an image to a multimodal model and ask about it — describe the scene, answer a question about the picture, read a chart. Same idea as audio: the model takes the image as input alongside your text prompt.

Create visual outputs with generative models (2.3.2)

Use an image-generation model (for example DALL-E) to produce a new image from a text prompt — marketing visuals, mockups, illustrations.

Build a vision app (2.3.3)

Combine the two: an app that interprets images with a multimodal model, generates images when needed, or calls Azure AI Vision for classic capabilities such as OCR, image classification, and object detection.

Keep the split clear: interpreting an existing image = multimodal model or Azure AI Vision; creating a new image = image-generation model. Reading text in a photo = OCR; reading named fields from a form = information extraction.

Azure Content Understanding, available through Microsoft Foundry Tools, extracts structured information across four content types. You define an analyzer (the fields or schema you want), run content through it, and get structured results back.

Documents and forms (2.4.1)
Key-value fields, tables, and signatures from PDFs, invoices, and contracts.
Images (2.4.2)
Structured fields from photos of receipts, IDs, and labels.
Audio (2.4.3)
Transcripts plus structured details extracted from recordings.
Video (2.4.3)
Scene analysis, transcripts, and event detection from video.

Objective 2.4.4 ties it together: build a lightweight app whose job is information extraction using Content Understanding.

The scenario asks to…Correct tool
Read the text that appears in an imageOCR (Azure AI Vision)
Say what objects are in an imageComputer vision (Azure AI Vision)
Pull invoice number, date, and total from a scanned PDFAzure Content Understanding
Transcribe a recording and extract the names and dates mentionedContent Understanding
For Domain 2 questions, the named Foundry tool for all four content types (documents, images, audio, video) is Azure Content Understanding — that is the answer the exam wants here. Azure AI Document Intelligence (was Form Recognizer) is the dedicated, standalone forms service you may see referenced as a concept, but in Foundry Tools the objective is Content Understanding.

These are Microsoft's own free hands-on exercises — the authoritative labs for this exam. Do them in order: lab 0 creates the Foundry project that every later lab reuses. Tick each one as you finish (saves in this browser).

Run every lab inside one resource group so cleanup is a single delete at the end. The labs use cheap mini models, so usage stays well within the Azure free-account $200 credit.
DoneLabMaps toTime
0 · Get started with Microsoft Foundry 2.1.230 min
2a · Generative AI and agents (File Search / agents) 2.1.1–2.1.535 min
4a · Speech in Microsoft Foundry 2.2.2, 2.2.325 min
5a · Computer vision and image generation 2.3.1–2.3.330 min
6a · Information extraction (Content Understanding) (vision vs extraction) 2.4.1–2.4.425 min
7 · Foundry IQ (grounding agents in knowledge) 2.1.420 min

A credit or debit card is required for identity verification on the free account, but you are not charged unless you manually upgrade to pay-as-you-go.