Automated Monitoring & Remediation Framework Using Azure and DataDog,

Description

This project establishes an automated monitoring and remediation framework using Azure infrastructure, DataDog for alerting and visibility, and Azure Functions (Python) to execute remediation tasks like restarting VMs

Importance

In hybrid and healthcare cloud environments, fast response to performance degradation is critical. This project showcases event-driven automation to maintain system availability and reduce the burden on SOC teams through auto-remediation workflows integrated with DataDog and Azure.

Objectives

Enable real-time infrastructure monitoring in Azure via DataDog

Automatically restart VMs experiencing high CPU utilization

Integrate DataDog alerting with Azure Functions via secure webhook

Test remediation using synthetic CPU stress events

Track function execution with logs and metrics

Create CPU trend dashboards for Linux and Windows VMs

Validate alert-email-notification loop for monitoring

Showcase seamless Azure-DataDog integration with webhook logic

Tech Stack

Cloud: Azure (VMs, Function App, Log Analytics)

Languages: Python 3.10, Bash

Monitoring: DataDog

IDE: VSCode

Architecture Overview

Workflow Steps:

Azure VM experiences CPU spike.

DataDog Monitor detects and triggers alert.

Webhook Workflow triggers on threshold breach.

HTTP GET hits Azure Function endpoint.

Azure Function restarts the VM.

Logs track all function invocations via Azure Monitor.

Dashboards in DataDog visualize alert trends and CPU comparisons.

Services & Components:

Azure: VM, Function App, App Service Plan, Log Analytics, CLI

DataDog: Monitor, Webhook, Dashboard

Email: Alert notification for threshold breach

Implantation

1. Provisioned an Azure Resource Group and deployed both Linux and Windows virtual machines, enabling diagnostic logging and installing a sample NGINX workload on the Linux VM for service-level testing.

2. Created a centralized Log Analytics Workspace and connected both VMs using the Azure Monitor agent to enable performance and event log collection across the environment.

3. Configured DataDog-Azure integration using a service principal with Reader and Monitoring Reader roles, and verified successful ingestion of metrics including CPU, memory, and disk usage into the DataDog platform.

4. Defined alert monitors in DataDog targeting high CPU utilization using azure.vm.percentage_cpu, and configured the action to trigger a webhook integrated with an Azure Function endpoint.

5. Developed an Azure Function using Python 3.10 with an HTTP trigger, capable of restarting virtual machines using the azure-mgmt-compute SDK and extracting parameters from the webhook payload.

6. Connected DataDog to the remediation function by adding a webhook monitor with dynamic payload fields for resource group, VM name, and alert context.

7. Simulated high CPU activity on the Linux VM using stress, successfully triggering the alert, invoking the function, and confirming the VM reboot via Azure boot time and function logs.

8. Built a custom dashboard in DataDog to display VM CPU trends, triggered alert history, function invocation timing, and stress test overlays for full operational transparency.

Challenges and Resolutions

Challenge: Webhook not triggering initially

Resolution: Ensured correct GET method, verified Function App URL and CORS settings

Challenge: Azure Function permissions error

Resolution: Used DefaultAzureCredential() and ensured role assignment for VM Contributor