AI-901 Reading Companion — Concept Notes

How to use this page. Read every section top to bottom and tick each one as you finish (your progress saves in this browser). The goal here is not tricks — it is exposure to the exact names Microsoft uses. The exam rewards matching a scenario to the one correct service or capability, so the bold terms and the purple "Say it the Microsoft way" boxes are the things to lock in.

Order of study: 1) this reading → 2) the study guide (traps and quick-fire tables) → 3) hands-on labs → 4) re-test.

Domain 1 Identify AI concepts and capabilities — 40–45%

Microsoft groups responsible AI into six principles. The exam describes a situation and asks which principle it concerns. Learn the six names and the one-line idea behind each.

Principle	Core idea	Scenario signal
Fairness	The system treats all people and groups equitably.	A group is advantaged or disadvantaged for a reason unrelated to the task (gender, ethnicity, dialect, which school they attended).
Reliability and safety	The system behaves consistently and safely, including in unexpected conditions.	The model must perform predictably, fail gracefully, and be tested before release. Self-driving, medical, and financial scenarios.
Privacy and security	Personal data is protected and access is controlled.	Keeping customer data confidential, consent, securing keys and access to the model.
Inclusiveness	The system empowers everyone and engages people of all abilities.	Accessibility, support for many languages, accents, and abilities; not leaving groups out.
Transparency	People understand how the system works and when they are dealing with AI.	Disclosing that a voice or chat is AI, explaining how a decision was reached, documenting limitations.
Accountability	People remain responsible and answerable for how the AI behaves.	Human oversight of high-stakes outputs, audit logging, governance, someone owns the outcome.

There are exactly six principles. If an answer option invents a seventh (for example "efficiency" or "profitability"), it is wrong. "Reliability and safety" and "privacy and security" are each one principle, not four.

Watch the difference between fairness (a group is treated worse for an irrelevant reason) and inclusiveness (a group is left out or cannot access the service at all). A hiring tool scoring one university lower = fairness. A speech app that fails for users with a speech impediment = inclusiveness.

A realistic AI voice that makes customers believe they are talking to a human is a transparency problem (they were not told it is AI), not reliability.

How generative AI models work

A large language model (LLM) is trained on a very large body of text. It breaks text into tokens (word pieces) and represents meaning as numeric vectors called embeddings. At its core the model does one thing: it predicts the next token, one token at a time, based on probability. Modern models use the transformer architecture, which uses attention to weigh which earlier tokens matter most.

Models are stateless. They do not remember past calls. The full conversation history is sent again on every request, and that history counts against the token limit.
The amount of text a model can consider at once is the context window, measured in tokens (it covers your prompt plus the response).
Models can hallucinate (produce confident but wrong content). To improve accuracy you ground the model in trusted data using retrieval-augmented generation (RAG), not by lowering temperature.
Fine-tuning further trains a base model on your own examples to specialise its behaviour. Multimodal models accept more than text — images, audio, or both.

Choosing an appropriate model

Pick the model by what the task needs, then by cost, speed, and context size.

You need to…	Choose
Chat, write, summarise, reason over text	A text/chat model (for example GPT-4o, GPT-4o-mini)
Understand images or audio as input	A multimodal model (for example GPT-4o, Phi multimodal)
Create images from a text prompt	An image-generation model (for example DALL-E)
Turn text into vectors for search or RAG	An embedding model (for example text-embedding-3)
Keep cost and latency low	A small model (a "mini" or Phi small language model)

A bigger model is not automatically the right answer. If the scenario stresses low cost, low latency, or high volume of simple requests, the correct pick is a small / mini model.

Deployment options and configuration parameters

In Microsoft Foundry you deploy a model from the model catalog, which gives you an endpoint to call. Two broad deployment styles:

Serverless / standard (pay-as-you-go) — no infrastructure to manage, you pay per token. Fast to start, good for variable or low traffic.
Managed compute (dedicated) — you reserve compute for predictable, high-volume workloads.

You may also choose a region or data-zone for data residency. After deployment you tune inference parameters:

Parameter	What it controls
`temperature`	Randomness. Low (near 0) = focused and deterministic; high = more varied and creative.
`top_p`	Nucleus sampling — restricts choices to the most probable tokens. Tune temperature or top_p, not both.
`max tokens`	Caps the length of the response.
`frequency / presence penalty`	Discourage repetition / encourage new topics.
`stop sequences`	Text that tells the model to stop generating.

temperature = 0 means deterministic (same input gives the same output). It does not mean "accurate." A model at temperature 0 can still be wrong. For factual accuracy you ground with RAG.

Naming the workload (1.3.1)

The exam gives a scenario and asks which workload it is. Decide by what the system does with the input, not by what the input is.

Workload	The system is…	Not this
Generative AI	Creating new content — text, summaries, images, code.	If it takes actions on your behalf → agentic.
Agentic AI	Taking multi-step actions autonomously — booking, emailing, calling tools.	If it only writes a reply → generative.
Text analysis (natural language processing)	Pulling meaning from text — sentiment, entities, key phrases, language.	If the source is a scanned form or image → information extraction.
Speech	Converting between audio and text, or speaking.	If analysing the words after transcription → text analysis.
Computer vision	Interpreting a scene — classifying an image, detecting objects, reading text (OCR).	If extracting named fields from a document → information extraction.
Information extraction	Pulling structured fields out of documents, images, audio, or video.	If only reading raw text from an image → OCR (computer vision).

An input being an image does not make it computer vision. Reading the supplier, date, and total out of a scanned invoice is information extraction, even though the input is a picture.

Text analysis techniques (1.3.2)

Sentiment analysis

Tone of the text: positive, negative, or neutral, with confidence scores.

Key phrase extraction

The main talking points or topics in the text.

Entity detection (NER)

Names of people, places, organisations, dates, quantities.

Summarisation

A shorter version: extractive (picks key sentences) or abstractive (writes new text).

Other capabilities in Azure AI Language: language detection (which language), PII detection (find and redact personal data), entity linking, conversational language understanding (CLU) for intent and entities, and custom question answering for a Q&A knowledge base.

Use Azure AI Language, never Text Analytics or Cognitive Services. CLU replaces LUIS; custom question answering replaces QnA Maker.

Speech (1.3.3)

Speech-to-text (speech recognition) — audio in, text out. Real-time or batch.
Text-to-speech (speech synthesis) — text in, spoken audio out, using neural voices.
SSML (Speech Synthesis Markup Language) — XML markup to control pitch, pace, pauses, and switch voices.
Speech translation — spoken word in one language to another.

The service is Azure AI Speech. "Recognition" = speech-to-text. "Synthesis" = text-to-speech.

Computer vision and image generation (1.3.4)

Image classification

What is this a picture of? One label for the whole image.

Object detection

Which objects, and where? Bounding boxes plus labels.

OCR (Read)

Extract printed or handwritten text that appears in an image.

Face detection

Locate faces and estimate attributes (detection, not identification).

Image generation is the reverse: a generative model (for example DALL-E) creates a new image from a text prompt. Recognising versus creating an image is a common exam split.

The service is Azure AI Vision. Training your own image categories is Azure AI Vision custom (this replaced Custom Vision).

Extracting information from text, images, audio, and video (1.3.5)

When the goal is structured fields out of mixed content, the workload is information extraction. Across documents, images, audio, and video this is the job of Azure Content Understanding. Forms and documents specifically can also use Azure AI Document Intelligence.

Azure AI Document Intelligence replaced Form Recognizer. Azure Content Understanding is the newer multimodal service that also handles images, audio, and video.

Domain 2 Implement AI solutions by using Microsoft Foundry — 55–60%

The Foundry mental model

Microsoft Foundry is the unified portal at ai.azure.com for building AI solutions. You work inside a Foundry project, deploy models from the model catalog, experiment in the playground, and call deployed models from code through the Foundry SDK.

Use Microsoft Foundry, never Azure AI Studio or Azure AI Foundry. Foundry includes all Azure AI services, not only OpenAI models. Sign-in identity is Microsoft Entra ID, never Azure AD.

System prompt vs user prompt (2.1.1)

Prompt	Who writes it	Purpose
System prompt (system message)	The developer	Sets persona, rules, tone, constraints, and output format. Applied before any user input.
User prompt	The end user	The actual question or task.

Any "always respond formally / only answer about policy / reply in JSON" instruction goes in the system prompt, not the user prompt and not a deployment setting.

Deploy and interact in the portal (2.1.2)

From the model catalog you select a model, deploy it, and test it in the chat playground before writing any code. Deploying gives you an endpoint and a deployment name.

Lightweight chat client with the Foundry SDK (2.1.3)

The build lessons call Foundry from Python using the azure-ai-projects package and authenticate with DefaultAzureCredential from azure.identity — keyless sign-in through Microsoft Entra after az login. The only configuration value you supply is the project endpoint.

Preferred auth is keyless with DefaultAzureCredential, not a hard-coded API key. Never put a key in code.

Single agent in the portal (2.1.4) and its client app (2.1.5)

An agent in Foundry is exactly three parts:

Component	Meaning
Model	The deployed model from the catalog that powers the agent.
Instructions	The system prompt that defines the agent's goal and behaviour.
Tools	Capabilities you attach so the agent can do more than chat.

The tools you can attach are a frequent exam topic:

File Search (grounding)

RAG over documents you upload. "Answer from my PDFs."

Function / tool calling

Call your own code or API to take an action or fetch live data.

Code Interpreter

Run Python in a sandbox for maths, data, and charts.

Bing / web grounding

Ground answers in live public web results.

A client app for an agent uses the azure-ai-agents capabilities of the Foundry SDK, again with DefaultAzureCredential.

"Answer using my uploaded documents" = File Search (RAG), not function calling. Function calling is for taking actions or pulling live external data through your own code.

Text analysis app (2.2.1)

You build a lightweight app that analyses text — for example classifying sentiment, pulling key phrases, or extracting entities — using a deployed model in Foundry or Azure AI Language capabilities. Match the requested task to the right capability (sentiment for tone, key phrases for topics, NER for names).

Respond to spoken prompts with a multimodal model (2.2.2)

A multimodal model (for example GPT-4o) can take audio directly as input and respond, without a separate transcription step. This preserves tone and intent that a text-only pipeline would lose.

If a question asks how to preserve the emotion or tone of a caller's voice, the answer is a multimodal model that ingests the audio — not a speech-to-text → analyse → speak pipeline, which strips prosody.

Azure Speech in Foundry Tools (2.2.3)

For dedicated, enterprise speech — call-centre transcription, custom neural voices, fine SSML control — you use Azure AI Speech through Microsoft Foundry Tools: speech-to-text for transcription and text-to-speech with SSML for synthesis.

Two valid paths to "speech." A multimodal model for conversational, tone-aware responses; Azure AI Speech for dedicated transcription and synthesis with voice and SSML control. Read the scenario to pick.

Interpret visual input with a multimodal model (2.3.1)

Send an image to a multimodal model and ask about it — describe the scene, answer a question about the picture, read a chart. Same idea as audio: the model takes the image as input alongside your text prompt.

Create visual outputs with generative models (2.3.2)

Use an image-generation model (for example DALL-E) to produce a new image from a text prompt — marketing visuals, mockups, illustrations.

Build a vision app (2.3.3)

Combine the two: an app that interprets images with a multimodal model, generates images when needed, or calls Azure AI Vision for classic capabilities such as OCR, image classification, and object detection.

Keep the split clear: interpreting an existing image = multimodal model or Azure AI Vision; creating a new image = image-generation model. Reading text in a photo = OCR; reading named fields from a form = information extraction.

Azure Content Understanding, available through Microsoft Foundry Tools, extracts structured information across four content types. You define an analyzer (the fields or schema you want), run content through it, and get structured results back.

Documents and forms (2.4.1)

Key-value fields, tables, and signatures from PDFs, invoices, and contracts.

Images (2.4.2)

Structured fields from photos of receipts, IDs, and labels.

Audio (2.4.3)

Transcripts plus structured details extracted from recordings.

Video (2.4.3)

Scene analysis, transcripts, and event detection from video.

Objective 2.4.4 ties it together: build a lightweight app whose job is information extraction using Content Understanding.

The scenario asks to…	Correct tool
Read the text that appears in an image	OCR (Azure AI Vision)
Say what objects are in an image	Computer vision (Azure AI Vision)
Pull invoice number, date, and total from a scanned PDF	Azure Content Understanding
Transcribe a recording and extract the names and dates mentioned	Content Understanding

For Domain 2 questions, the named Foundry tool for all four content types (documents, images, audio, video) is Azure Content Understanding — that is the answer the exam wants here. Azure AI Document Intelligence (was Form Recognizer) is the dedicated, standalone forms service you may see referenced as a concept, but in Foundry Tools the objective is Content Understanding.

These are Microsoft's own free hands-on exercises — the authoritative labs for this exam. Do them in order: lab 0 creates the Foundry project that every later lab reuses. Tick each one as you finish (saves in this browser).

Run every lab inside one resource group so cleanup is a single delete at the end. The labs use cheap mini models, so usage stays well within the Azure free-account $200 credit.

Done	Lab	Maps to	Time
✓	0 · Get started with Microsoft Foundry	2.1.2	30 min
✓	2a · Generative AI and agents (File Search / agents)	2.1.1–2.1.5	35 min
✓	4a · Speech in Microsoft Foundry	2.2.2, 2.2.3	25 min
✓	5a · Computer vision and image generation	2.3.1–2.3.3	30 min
✓	6a · Information extraction (Content Understanding) (vision vs extraction)	2.4.1–2.4.4	25 min
✓	7 · Foundry IQ (grounding agents in knowledge)	2.1.4	20 min

Create a free Azure account ($200 / 30 days) All labs index Foundry portal · ai.azure.com

A credit or debit card is required for identity verification on the free account, but you are not charged unless you manually upgrade to pay-as-you-go.

Finished reading and labs? Final steps before 5 July:

• Work through the AI-901 Study Guide — the traps, quick-fire tables, and practice questions now make sense because you have the vocabulary.

• Re-take the question drill with your Cert Buddy to confirm the gaps have closed.