| Principle | What it means | Azure example |
|---|---|---|
| Fairness | AI should treat all people equitably | Sentiment analysis must not penalise dialects; bias testing before deployment |
| Reliability & Safety | AI should perform reliably and safely | Check result.reason before parsing; handle failures gracefully |
| Privacy & Security | AI should be secure and respect privacy | Never hardcode API keys; use DefaultAzureCredential or env vars |
| Inclusiveness | AI should empower everyone | STT trained on diverse accents, speech impediments, background noise |
| Transparency | AI should be understandable | Disclose when users are talking to AI (neural voice), not a human |
| Accountability | People should be accountable for AI | Log AI decisions for audit; human oversight of high-stakes outputs |
How generative AI models work
LLMs predict the next token based on probability distributions. They're trained on massive text corpora, then fine-tuned. They're stateless β full conversation history must be sent with every API call.
Key configuration parameters
| Parameter | What it controls | Exam gotcha |
|---|---|---|
temperature | Randomness of output. 0 = deterministic, 1+ = creative | Temperature 0 β accurate. It means same input β same output, NOT that the output is correct |
top_p | Nucleus sampling β only consider tokens whose cumulative probability β€ top_p | Lower top_p = more focused. Usually adjust one of temperature/top_p, not both |
max_tokens | Maximum length of generated response | Shared with input context. "Context length exceeded" = too many tokens total |
frequency_penalty | Penalises tokens proportionally to how often they've appeared | Reduces repetition |
presence_penalty | Penalises tokens for appearing at all (flat penalty) | Encourages new topics |
stop | Sequences that halt generation | Useful for structured output (lists, dialogs) |
Deployment options in Foundry
Serverless (pay-per-token)
No reserved GPUs. Pay only for tokens consumed. Fast to start. Best for variable/low traffic.
Dedicated endpoint
Reserved compute. Predictable latency. Better for high-volume, production workloads.
System prompt vs User prompt
| Prompt type | Who sets it | Purpose |
|---|---|---|
| System prompt | Developer | Defines persona, rules, constraints, output format. Applied globally before any user input |
| User prompt | End user | The actual question or task |
Choosing an appropriate model
| Need | Model type |
|---|---|
| Text generation, chat, reasoning | GPT-4o, GPT-4o-mini, Meta Llama |
| Multimodal (text + image + audio) | GPT-4o, phi-4-multimodal-instruct |
| Image generation | DALL-E 3, GPT-image models |
| Embeddings (for RAG/search) | text-embedding-ada-002, text-embedding-3-small |
| Lightweight/fast/cheap | GPT-4o-mini, Phi models |
The exam gives you a scenario and asks: which AI workload is this? Know these categories:
| Workload | Scenario signal | Not this |
|---|---|---|
| Generative AI | Creating content β writing, summarising, generating images | If it's taking actions β agentic |
| Agentic AI | Taking actions β booking, emailing, searching, creating tickets, multi-step reasoning | If it's only creating content β generative |
| Text analysis | Extracting structured info from text β sentiment, entities, keywords | If extracting from images/PDFs β information extraction |
| Speech | Converting audioβtext, voice interaction | If analysing what was said β text analysis |
| Computer vision | "What is in this image?" β classification, detection, OCR | If extracting structured fields from a doc β information extraction |
| Information extraction | Extracting structured fields from documents, forms, images, audio, video | If just reading text in an image β OCR/vision |
Text analysis techniques
Speech capabilities
Speech-to-Text (STT)
Audio β text. Real-time or batch. Speaker diarization: identifies who said what.
Text-to-Speech (TTS)
Text β audio. Neural voices. SSML: XML markup for pitch, pace, pauses, multi-speaker.
Computer vision capabilities
Information extraction techniques
Azure Content Understanding extracts structured data from: documents, forms, images, audio, and video. It replaces legacy Form Recognizer / Document Intelligence.
ML fundamentals (light coverage)
| Type | What it does | Example |
|---|---|---|
| Supervised | Learns from labelled data | Classification (spam/not spam), Regression (predict price) |
| Unsupervised | Finds patterns in unlabelled data | Clustering (customer segments) |
| Classification | Predicts a category | Is this email spam? |
| Regression | Predicts a number | What will the house sell for? |
| Clustering | Groups similar items | Customer segmentation |
Foundry (formerly Azure AI Studio) is a unified PaaS ecosystem at ai.azure.com.
| Component | Role |
|---|---|
| Hub | Top-level admin boundary β security, networking, billing |
| Project | Workspace β models, vector indexes, datasets, agents |
| Connections | Authenticated links to Azure Storage, AI Search, Speech, etc. |
| Endpoint | https://{name}.services.ai.azure.com/api/projects/{project} |
| Model Catalog | Browse & deploy OpenAI, Meta Llama, Phi, Mistral models |
Terminology renames β don't get caught
| Legacy (wrong on exam) | Current (correct) |
|---|---|
| Azure AI Studio | Microsoft Foundry |
| Azure AD | Microsoft Entra ID |
| Azure Cognitive Services | Azure AI services |
| Form Recognizer | Content Understanding (via Document Intelligence) |
| LUIS | Azure AI Language CLU |
| Assistants API | Responses API (Agents v2) |
This table is the highest-value thing to memorise. The exam swaps SDK responsibilities as distractors.
| Package | Purpose | Key classes |
|---|---|---|
azure-ai-projects | Foundry orchestration β agents, tools, vector stores, evaluators | AIProjectClient β .agents, .evaluators, .vector_stores |
azure-ai-inference | Chat completions, multimodal input (text + image + audio) | ChatCompletionsClient, InputAudio, AudioContentItem, ImageContentItem |
azure-cognitiveservices-speech | Dedicated STT and TTS via Azure Speech | SpeechConfig, SpeechRecognizer, SpeechSynthesizer |
azure-ai-contentunderstanding | Structured extraction from docs/images (replaces Form Recognizer) | Analyzer config β structured objects |
azure.identity | Authentication across all SDKs | DefaultAzureCredential() |
azure-ai-inference for transcription or azure-cognitiveservices-speech for entity extraction. Know which SDK handles which job β they don't cross over.Agent definition (testable)
An agent in Foundry = three components:
Model = from catalog (GPT-4o etc). Instructions = system prompt defining goals/behaviour. Tools = FileSearchTool, CodeInterpreter, external APIs.
Lightweight chat client SDK pattern
Agent with tools (FileSearchTool + RAG)
ai.azure.com before exam day.In AI-901, text analysis is done via agents with constrained system prompts, not dedicated text analytics APIs:
AIProjectClient.agentsFor document-scale analysis, bind a FileSearchTool with a vector store. For structured extraction from scanned PDFs/forms, use Content Understanding (azure-ai-contentunderstanding).
Legacy pipeline (wrong approach on exam)
High latency. Strips emotional prosody β model only sees sterile text.
Modern multimodal (correct approach)
Sub-second. Understands tone, sarcasm, urgency directly from waveforms.
"Invalid Input" error. This is a heavily tested distractor.Correct class chain
InputAudioAudioContentItemUserMessage.complete()Dedicated Speech services are still tested for enterprise STT/TTS, call centres, and SSML control.
Speech-to-Text
recognize_once_async() is async. Exam tests whether you know it doesn't block. Always check result.reason before result.text.Text-to-Speech + SSML
SpeechSynthesisOutputFormat is for TTS audio quality ONLY. Distractors will use it as a "filter" for analytics or sentiment β it's not an analytics tool.Interpreting visual input (multimodal)
Same pattern as audio β use azure-ai-inference to send images to a multimodal model:
ImageContentItemUserMessage.complete()Models: gpt-4o, phi-4-multimodal-instruct β same models that handle audio also handle images natively.
Generating images
Use generative image models (DALL-E 3 / GPT-image) via Foundry. Provide a text prompt, receive generated image. Common use: marketing visuals, product mockups, creative content.
Azure AI Vision capabilities (conceptual)
| Capability | What it does |
|---|---|
| Image classification | Labels the whole image ("beach", "office") |
| Object detection | Locates objects with bounding boxes |
| OCR | Extracts text from images |
| Face detection | Locates faces, estimates age/emotion (not identify) |
| Spatial analysis | Analyses movement/position of people in video |
Azure Content Understanding is a Foundry Tool for extracting structured information. It handles:
SDK: azure-ai-contentunderstanding. Define an analyzer config β pass binary data β iterate structured results.
Content Understanding vs OCR vs Vision
| Task | Tool |
|---|---|
| "Read the text in this image" | OCR (Azure AI Vision) |
| "What objects are in this image?" | Computer Vision |
| "Extract invoice number, date, total from this scanned PDF" | Content Understanding |
| "Transcribe this audio and extract speaker names and dates mentioned" | Content Understanding |
| Principle | Text | Speech | Vision |
|---|---|---|---|
| Fairness | Sentiment must not penalise dialects | β | Face detection accuracy across skin tones |
| Inclusiveness | Support multiple languages | STT trained on diverse accents, impediments | Accessibility for visually impaired users |
| Transparency | Disclose AI-generated content | Disclose neural voices are not human | Disclose AI-generated images |
| Privacy | Secure handling of analysed text | Never hardcode speech keys | Face detection β identification |
| Reliability | Check results before acting | Handle silent audio, auth failures | Handle low-confidence predictions |
| Accountability | Log analysis for audit | Log synthesised interactions | Human review of high-stakes decisions |
| Trap | Truth |
|---|---|
| Temperature 0 = accurate answers | Temperature 0 = deterministic (same input β same output). Model can still hallucinate. For accuracy, use RAG or better prompts |
| Models remember previous conversations | Models are stateless. Full history must be sent with every API call |
| Manually Base64-encode audio for multimodal | Use SDK classes (InputAudio β AudioContentItem). SDK handles encoding |
| SpeechSynthesisOutputFormat extracts sentiment | It's a TTS output quality setting only |
| GPT-3.5 can ingest raw audio | GPT-3.5 is text-only. Use GPT-4o or phi-4-multimodal for audio |
| SSML is for text formatting | SSML is for speech synthesis control (pitch, pace, pauses, voices) |
| OCR = information extraction | OCR reads text. Content Understanding extracts structured fields |
| Foundry = Azure OpenAI | Foundry includes ALL Azure AI services, not just OpenAI |
| Behavioural rules go in user prompt | System prompt = developer rules. User prompt = end user questions |
| Agent = endpoint + API key | Agent = model + instructions + tools |
"Invalid Input". They're manually Base64-encoding audio into a custom JSON dict. What's the fix?InputAudio(format=AudioContentFormat.WAV) β AudioContentItem β UserMessage.azure-ai-projects, reference by GUID.azure-cognitiveservices-speech to transcribe first, then pass text to the model.azure-ai-inference for transcription β azure-cognitiveservices-speech for entity extraction.azure-cognitiveservices-speech for STT β azure-ai-projects agent for text analysis.SpeechSynthesisOutputFormat filter to extract sentiment.azure-ai-inference with GPT-3.5 to ingest raw audio and return SSML-highlighted entities.| Resource | Link |
|---|---|
| Official study guide | learn.microsoft.com/.../ai-901 |
| Free training: AI concepts path | learn.microsoft.com/.../ai-concepts |
| Free training: AI apps & agents | learn.microsoft.com/.../ai-apps-agents |
| Foundry Portal | ai.azure.com β explore before exam day |
| GitHub labs | MicrosoftLearning/mslearn-ai-fundamentals |
| Tim Warner study repo | timothywarner-org/ai901 |
| Free practice: Tutorials Dojo (20 Qs) | tutorialsdojo.com |
| Free practice: A Guide to Cloud (265 Qs) | aguidetocloud.com |
| Free practice: Examinotion (5 worked Qs) | examinotion.com |