Build a Self-Healing Azure VM: Alerts + Automation Runbooks Step-by-Step

I. Introduction: Why Use Alerts and Runbooks?

When managing infrastructure in the cloud, proactive monitoring is essential. Azure Monitor allows you to detect issues using telemetry data, while Azure Automation Runbooks help you respond automatically when problems occur.

An Alert Rule defines the conditions under which Azure should raise an alert. When triggered, it can notify stakeholders, create tickets, or even execute automated scripts.

Key Components of an Alert Rule:

ComponentDescription
Target ResourceThe specific Azure service being monitored (VMs, Web Apps, Storage Accounts, etc.)
Signal TypeMetrics, Activity Logs, Application Insights logs, or custom logs emitted by the resource
CriteriaLogical condition applied to the signal (e.g., CPU > 80%, VM stopped)
SeverityCriticality of the alert (0 to 4, with 0 being most urgent)
ActionWhat happens when the alert fires (e.g., email, webhook, automation runbook)

II. Create an Alert Rule to Monitor VM Events

Let’s walk through a scenario where we want to monitor when a Virtual Machine is stopped (deallocated) and receive notifications.

1. Create a Virtual Machine

    2. Navigate to the “Alerts” section and click “New alert rule”

    3. Scope:

    Select the subscription and target VM

    4. Condition:

    • Choose Activity Log
    • Select the signal: Deallocate Virtual Machine (Microsoft.Compute/virtualMachines)

    5. Alert logic:

    When setting up an Activity Log Alert to monitor events like stopping a Virtual Machine, Azure provides configurable logic filters under the Alert logic section. These help you fine-tune exactly when and why an alert should be triggered.

    Below is a breakdown of the three default dropdowns shown in your screenshot:

    • Event Level

    Default value: All selected
    Purpose: Filters the alert based on the severity of the event logged.

    LevelDescription
    CriticalIndicates serious issues like system failures.
    ErrorRepresents failed operations or system errors.
    WarningA non-critical warning that might require attention.
    InformationalGeneral info events—e.g., VM start/stop actions.
    VerboseHighly detailed events, mostly used for debugging.

    ✅ When monitoring for Stop VM events, it’s recommended to leave this at “All selected”, since VM deallocation is usually classified as Informational.

    • Status

    Default value: All selected
    Purpose: Filters events by execution status.

    StatusMeaning
    StartedOperation has begun but is not completed.
    SucceededOperation completed successfully.
    FailedOperation failed to complete.
    All selectedIncludes all outcomes.

    ✅ To ensure alerts only trigger when a VM actually stops, it’s advisable to change this to “Succeeded”—which prevents false positives if the stop action fails.

    • Event Initiated By

    Default value: * (All services and users)
    Purpose: Defines who or what initiated the event.

    OptionMeaning
    User (e.g., email)Action performed by a specific user account
    Azure service (e.g., Microsoft.Compute)Action triggered by an Azure service or automation
    * (All services and users)Default—includes both users and system services

    ✅ For general-purpose alerts, it’s safe to keep the default *, which covers all initiators—whether from a user, service principal, or automation.

    6. Action Group:

    • Click “Add action group”
    • Set notification type (Email/SMS/Push/Voice)

    7. Review + Create the alert rule:

    8. Test by manually stopping the VM:

    Once your alert rule is created and the action group (with email notification) is in place, it’s time to test the setup:

      1. Navigate to your Virtual Machine in the Azure portal.
      2. Click “Stop” to deallocate the VM.
      3. Wait for a few moments (typically under 5 minutes) for the event to be captured by Azure Monitor.
      4. If the event matches the alert logic (e.g., Deallocate Virtual Machine with Status: Succeeded), the alert will be fired and the action group will be executed.
      5. If you configured the action group to send an email, you will receive a notification like this:

      Subject:
      Azure Monitor alert ‘Monitoring Status VM’ was activated for ‘k8s-master’ at July 26, 2025 4:23 UTC

      III. Use an Alert to Trigger an Azure Automation Runbook

      🤖 Scenario: Automatically Restart a VM When CPU Usage Exceeds 80%

      Instead of manually monitoring CPU spikes and restarting VMs, we can automate this response using Azure Monitor Alerts and Automation Runbooks. Let’s build a real-world scenario where the system monitors CPU usage, and if it goes over 80% for 10 minutes, Azure will trigger a Runbook to restart the VM automatically.

      ✅ Step 1: Create the Azure Automation Runbook

      1. Go to your Azure Automation Account > Create an Automation Account

        2. Navigate to Process Automation > Runbooks

        3. Click + Create a runbook

        • Name: RestartHighCpuVM
        • Type: PowerShell
        • Description: Restart VM when CPU exceeds threshold

        4. Paste the following PowerShell script:

        param (
            [Parameter (Mandatory = $false)]
            [object] $WebhookData
        )
        
        # Step 1: Check if webhook was passed
        if ($null -eq $WebhookData) {
            Write-Error "WebhookData is null. Are you running this manually?"
            return
        }
        
        try {
            Write-Output "✅ WebhookData received."
        
            # Step 2: Attempt to parse the JSON
            $jsonString = $WebhookData.RequestBody
            Write-Output "Preview JSON: $($jsonString.Substring(0, 200))..."
        
            $body = ConvertFrom-Json -InputObject $jsonString
            if ($null -eq $body.data) {
                Write-Error "Webhook body does not contain 'data' section."
                return
            }
        
            # Step 3: Extract info from context
            $context = $body.data.context
            $subscriptionId = $context.subscriptionId
            $resourceGroupName = $context.resourceGroupName
            $vmName = $context.resourceName
            $resourceId = $context.resourceId
        
            Write-Output "Parsed Subscription: $subscriptionId"
            Write-Output "Parsed Resource Group: $resourceGroupName"
            Write-Output "Parsed VM Name: $vmName"
        
            # Step 4: Authenticate and restart
          try {
            Connect-AzAccount -Identity
            Write-Output "✅ Connected to Azure with Managed Identity."
        }
        catch {
            Write-Error "❌ Failed to authenticate with Managed Identity: $_"
            return
        }
        
        try {
            Set-AzContext -SubscriptionId $subscriptionId
            Write-Output "✅ Subscription context set: $subscriptionId"
        }
        catch {
            Write-Error "❌ Failed to set subscription context to ${subscriptionId}: $_"
            return
        }
        
        try {
            Restart-AzVM -Name $vmName -ResourceGroupName $resourceGroupName
            Write-Output "✅ VM restart command issued successfully for: $vmName in RG: $resourceGroupName"
        }
        catch {
            Write-Error "❌ Failed to restart VM '${vmName}' in resource group '${resourceGroupName}': $_"
            return
        }
        
        } catch {
            Write-Error "❌ Exception during Runbook execution: $_"
        }

        5. Save and Publish the runbook

        Ensure the Automation Account has Contributor rights on the VM’s resource group included VM

        ✅ Step 2: Create the Alert Rule for CPU > 80%

        1. Navigate to the target Virtual Machine

          2. Go to Alerts > + New Alert Rule

          3. Scope: Select the VM

          4. Condition:

          • Signal: Percentage CPU
          • Aggregation: Average
          • Operator: Greater than
          • Threshold: 80
          • Check every: 1 minutes

          5. Action Group:

          • Click “Add action group”
          • Choose Automation Runbook as the action type
          • Select your automation account and the runbook RestartHighCpuVM
          • Make sure to enable webhook and keep default parameters (it carries resource info)

          6. Review the alert:

          • Name: HighCPU_RestartVM
          • Severity: 2 - Error
          • Description: Trigger runbook to restart VM when CPU > 80%

          7. Click Create

          ✅ Step 3: Test the Flow

          Simulate a CPU spike on the VM (e.g., using stress or a custom script)

          To test whether the alert and automation runbook trigger correctly, you need to simulate high CPU usage on the virtual machine. Here’s how you can do it:

          Install the stress utility (if not already installed):

          sudo apt update
          sudo apt install stress -y

          Run the stress test to spike CPU:

          stress --cpu 2 --timeout 600

          This command will use 2 CPU cores at 100% for 10 minutes (600 seconds), enough to trigger a CPU > 80% alert if your threshold is set accordingly.

          Azure Monitor detects that the VM’s average CPU usage has exceeded 80% over the last evaluation period (e.g., 5 or 10 minutes).

          • Alert Fires
            The alert rule becomes active and sends a webhook payload to the Automation Runbook.
          • View the job log under Automation > Jobs
          • Runbook Starts
          • In the Azure portal, under the Runbook’s Job history, you can observe a job instance with status:
            • Running (when the script is executing)
            • Completed (after the script finishes successfully)
          • Confirm restart event in the VM Activity Log

          Once the job status is “Completed”, you can verify that the VM has been restarted by connecting via SSH and running the following command

          uptime

          🎬 Conclusion: From Insight to Action — Automating Cloud Operations

          In this hands-on guide, we walked through the essential building blocks of automated monitoring and response in Azure — from identifying problems to resolving them without human intervention.

          • In Part I, we discovered how Azure Monitor Alerts act as a powerful watchtower for your infrastructure, allowing you to monitor everything from VM metrics to log analytics and application health.
          • In Part II, we created a real-world alert rule that watches for VM events like shutdowns, ensuring that critical changes don’t go unnoticed.
          • In Part III, we turned insight into action: a high CPU alert automatically triggered a Runbook that restarted the affected VM — a true example of self-healing infrastructure.

          This workflow not only saves time and reduces manual effort, but also increases system resiliency and responsiveness. With just a few steps, you’ve created an intelligent cloud automation pipeline that can detect, alert, and act — all on its own.

          💡 “Let Azure do the heavy lifting—while you get back to lifting your coffee mug.”

          ⚡ From monitoring to resolution — fully automated, fully Azure.

          Leave a Comment

          Your email address will not be published. Required fields are marked *