How to use this page. Read every section top to bottom and tick each one as you finish (your progress saves in this browser). The goal here is not tricks — it is exposure to the exact names Microsoft uses. The exam rewards matching a scenario to the one correct service or capability, so the bold terms and the purple "Say it the Microsoft way" boxes are the things to lock in.
Order of study: 1) this reading → 2) the study guide (traps and quick-fire tables) → 3) hands-on labs → 4) re-test.
Microsoft groups responsible AI into six principles. The exam describes a situation and asks which principle it concerns. Learn the six names and the one-line idea behind each.
| Principle | Core idea | Scenario signal |
|---|---|---|
| Fairness | The system treats all people and groups equitably. | A group is advantaged or disadvantaged for a reason unrelated to the task (gender, ethnicity, dialect, which school they attended). |
| Reliability and safety | The system behaves consistently and safely, including in unexpected conditions. | The model must perform predictably, fail gracefully, and be tested before release. Self-driving, medical, and financial scenarios. |
| Privacy and security | Personal data is protected and access is controlled. | Keeping customer data confidential, consent, securing keys and access to the model. |
| Inclusiveness | The system empowers everyone and engages people of all abilities. | Accessibility, support for many languages, accents, and abilities; not leaving groups out. |
| Transparency | People understand how the system works and when they are dealing with AI. | Disclosing that a voice or chat is AI, explaining how a decision was reached, documenting limitations. |
| Accountability | People remain responsible and answerable for how the AI behaves. | Human oversight of high-stakes outputs, audit logging, governance, someone owns the outcome. |
How generative AI models work
A large language model (LLM) is trained on a very large body of text. It breaks text into tokens (word pieces) and represents meaning as numeric vectors called embeddings. At its core the model does one thing: it predicts the next token, one token at a time, based on probability. Modern models use the transformer architecture, which uses attention to weigh which earlier tokens matter most.
- Models are stateless. They do not remember past calls. The full conversation history is sent again on every request, and that history counts against the token limit.
- The amount of text a model can consider at once is the context window, measured in tokens (it covers your prompt plus the response).
- Models can hallucinate (produce confident but wrong content). To improve accuracy you ground the model in trusted data using retrieval-augmented generation (RAG), not by lowering temperature.
- Fine-tuning further trains a base model on your own examples to specialise its behaviour. Multimodal models accept more than text — images, audio, or both.
Choosing an appropriate model
Pick the model by what the task needs, then by cost, speed, and context size.
| You need to… | Choose |
|---|---|
| Chat, write, summarise, reason over text | A text/chat model (for example GPT-4o, GPT-4o-mini) |
| Understand images or audio as input | A multimodal model (for example GPT-4o, Phi multimodal) |
| Create images from a text prompt | An image-generation model (for example DALL-E) |
| Turn text into vectors for search or RAG | An embedding model (for example text-embedding-3) |
| Keep cost and latency low | A small model (a "mini" or Phi small language model) |
Deployment options and configuration parameters
In Microsoft Foundry you deploy a model from the model catalog, which gives you an endpoint to call. Two broad deployment styles:
- Serverless / standard (pay-as-you-go) — no infrastructure to manage, you pay per token. Fast to start, good for variable or low traffic.
- Managed compute (dedicated) — you reserve compute for predictable, high-volume workloads.
You may also choose a region or data-zone for data residency. After deployment you tune inference parameters:
| Parameter | What it controls |
|---|---|
temperature | Randomness. Low (near 0) = focused and deterministic; high = more varied and creative. |
top_p | Nucleus sampling — restricts choices to the most probable tokens. Tune temperature or top_p, not both. |
max tokens | Caps the length of the response. |
frequency / presence penalty | Discourage repetition / encourage new topics. |
stop sequences | Text that tells the model to stop generating. |
temperature = 0 means deterministic (same input gives the same output). It does not mean "accurate." A model at temperature 0 can still be wrong. For factual accuracy you ground with RAG.Naming the workload (1.3.1)
The exam gives a scenario and asks which workload it is. Decide by what the system does with the input, not by what the input is.
| Workload | The system is… | Not this |
|---|---|---|
| Generative AI | Creating new content — text, summaries, images, code. | If it takes actions on your behalf → agentic. |
| Agentic AI | Taking multi-step actions autonomously — booking, emailing, calling tools. | If it only writes a reply → generative. |
| Text analysis (natural language processing) | Pulling meaning from text — sentiment, entities, key phrases, language. | If the source is a scanned form or image → information extraction. |
| Speech | Converting between audio and text, or speaking. | If analysing the words after transcription → text analysis. |
| Computer vision | Interpreting a scene — classifying an image, detecting objects, reading text (OCR). | If extracting named fields from a document → information extraction. |
| Information extraction | Pulling structured fields out of documents, images, audio, or video. | If only reading raw text from an image → OCR (computer vision). |
Text analysis techniques (1.3.2)
Other capabilities in Azure AI Language: language detection (which language), PII detection (find and redact personal data), entity linking, conversational language understanding (CLU) for intent and entities, and custom question answering for a Q&A knowledge base.
Speech (1.3.3)
- Speech-to-text (speech recognition) — audio in, text out. Real-time or batch.
- Text-to-speech (speech synthesis) — text in, spoken audio out, using neural voices.
- SSML (Speech Synthesis Markup Language) — XML markup to control pitch, pace, pauses, and switch voices.
- Speech translation — spoken word in one language to another.
Computer vision and image generation (1.3.4)
Image generation is the reverse: a generative model (for example DALL-E) creates a new image from a text prompt. Recognising versus creating an image is a common exam split.
Extracting information from text, images, audio, and video (1.3.5)
When the goal is structured fields out of mixed content, the workload is information extraction. Across documents, images, audio, and video this is the job of Azure Content Understanding. Forms and documents specifically can also use Azure AI Document Intelligence.
The Foundry mental model
Microsoft Foundry is the unified portal at ai.azure.com for building AI solutions. You work inside a Foundry project, deploy models from the model catalog, experiment in the playground, and call deployed models from code through the Foundry SDK.
System prompt vs user prompt (2.1.1)
| Prompt | Who writes it | Purpose |
|---|---|---|
| System prompt (system message) | The developer | Sets persona, rules, tone, constraints, and output format. Applied before any user input. |
| User prompt | The end user | The actual question or task. |
Deploy and interact in the portal (2.1.2)
From the model catalog you select a model, deploy it, and test it in the chat playground before writing any code. Deploying gives you an endpoint and a deployment name.
Lightweight chat client with the Foundry SDK (2.1.3)
The build lessons call Foundry from Python using the azure-ai-projects package and authenticate with DefaultAzureCredential from azure.identity — keyless sign-in through Microsoft Entra after az login. The only configuration value you supply is the project endpoint.
DefaultAzureCredential, not a hard-coded API key. Never put a key in code.Single agent in the portal (2.1.4) and its client app (2.1.5)
An agent in Foundry is exactly three parts:
| Component | Meaning |
|---|---|
| Model | The deployed model from the catalog that powers the agent. |
| Instructions | The system prompt that defines the agent's goal and behaviour. |
| Tools | Capabilities you attach so the agent can do more than chat. |
The tools you can attach are a frequent exam topic:
A client app for an agent uses the azure-ai-agents capabilities of the Foundry SDK, again with DefaultAzureCredential.
Text analysis app (2.2.1)
You build a lightweight app that analyses text — for example classifying sentiment, pulling key phrases, or extracting entities — using a deployed model in Foundry or Azure AI Language capabilities. Match the requested task to the right capability (sentiment for tone, key phrases for topics, NER for names).
Respond to spoken prompts with a multimodal model (2.2.2)
A multimodal model (for example GPT-4o) can take audio directly as input and respond, without a separate transcription step. This preserves tone and intent that a text-only pipeline would lose.
Azure Speech in Foundry Tools (2.2.3)
For dedicated, enterprise speech — call-centre transcription, custom neural voices, fine SSML control — you use Azure AI Speech through Microsoft Foundry Tools: speech-to-text for transcription and text-to-speech with SSML for synthesis.
Interpret visual input with a multimodal model (2.3.1)
Send an image to a multimodal model and ask about it — describe the scene, answer a question about the picture, read a chart. Same idea as audio: the model takes the image as input alongside your text prompt.
Create visual outputs with generative models (2.3.2)
Use an image-generation model (for example DALL-E) to produce a new image from a text prompt — marketing visuals, mockups, illustrations.
Build a vision app (2.3.3)
Combine the two: an app that interprets images with a multimodal model, generates images when needed, or calls Azure AI Vision for classic capabilities such as OCR, image classification, and object detection.
Azure Content Understanding, available through Microsoft Foundry Tools, extracts structured information across four content types. You define an analyzer (the fields or schema you want), run content through it, and get structured results back.
Objective 2.4.4 ties it together: build a lightweight app whose job is information extraction using Content Understanding.
| The scenario asks to… | Correct tool |
|---|---|
| Read the text that appears in an image | OCR (Azure AI Vision) |
| Say what objects are in an image | Computer vision (Azure AI Vision) |
| Pull invoice number, date, and total from a scanned PDF | Azure Content Understanding |
| Transcribe a recording and extract the names and dates mentioned | Content Understanding |
These are Microsoft's own free hands-on exercises — the authoritative labs for this exam. Do them in order: lab 0 creates the Foundry project that every later lab reuses. Tick each one as you finish (saves in this browser).
| Done | Lab | Maps to | Time |
|---|---|---|---|
| ✓ | 0 · Get started with Microsoft Foundry | 2.1.2 | 30 min |
| ✓ | 2a · Generative AI and agents (File Search / agents) | 2.1.1–2.1.5 | 35 min |
| ✓ | 4a · Speech in Microsoft Foundry | 2.2.2, 2.2.3 | 25 min |
| ✓ | 5a · Computer vision and image generation | 2.3.1–2.3.3 | 30 min |
| ✓ | 6a · Information extraction (Content Understanding) (vision vs extraction) | 2.4.1–2.4.4 | 25 min |
| ✓ | 7 · Foundry IQ (grounding agents in knowledge) | 2.1.4 | 20 min |
A credit or debit card is required for identity verification on the free account, but you are not charged unless you manually upgrade to pay-as-you-go.
Finished reading and labs? Final steps before 5 July:
• Work through the AI-901 Study Guide — the traps, quick-fire tables, and practice questions now make sense because you have the vocabulary.
• Re-take the question drill with your Cert Buddy to confirm the gaps have closed.