EndisProjects
Description
This project establishes an automated monitoring and remediation framework using Azure infrastructure, DataDog for alerting and visibility, and Azure Functions (Python) to execute remediation tasks like restarting VMs
Importance
In hybrid and healthcare cloud environments, fast response to performance degradation is critical. This project showcases event-driven automation to maintain system availability and reduce the burden on SOC teams through auto-remediation workflows integrated with DataDog and Azure.
Objectives
Enable real-time infrastructure monitoring in Azure via DataDog
Automatically restart VMs experiencing high CPU utilization
Integrate DataDog alerting with Azure Functions via secure webhook
Test remediation using synthetic CPU stress events
Track function execution with logs and metrics
Create CPU trend dashboards for Linux and Windows VMs
Validate alert-email-notification loop for monitoring
Showcase seamless Azure-DataDog integration with webhook logic
Tech Stack
Cloud: Azure (VMs, Function App, Log Analytics)
Languages: Python 3.10, Bash
Monitoring: DataDog
IDE: VSCode
Architecture Overview
Workflow Steps:
Azure VM experiences CPU spike.
DataDog Monitor detects and triggers alert.
Webhook Workflow triggers on threshold breach.
HTTP GET hits Azure Function endpoint.
Azure Function restarts the VM.
Logs track all function invocations via Azure Monitor.
Dashboards in DataDog visualize alert trends and CPU comparisons.
Services & Components:
Azure: VM, Function App, App Service Plan, Log Analytics, CLI
DataDog: Monitor, Webhook, Dashboard
Email: Alert notification for threshold breach

Implantation
1. Provisioned an Azure Resource Group and deployed both Linux and Windows virtual machines, enabling diagnostic logging and installing a sample NGINX workload on the Linux VM for service-level testing.
2. Created a centralized Log Analytics Workspace and connected both VMs using the Azure Monitor agent to enable performance and event log collection across the environment.
3. Configured DataDog-Azure integration using a service principal with Reader and Monitoring Reader roles, and verified successful ingestion of metrics including CPU, memory, and disk usage into the DataDog platform.
4. Defined alert monitors in DataDog targeting high CPU utilization using azure.vm.percentage_cpu, and configured the action to trigger a webhook integrated with an Azure Function endpoint.
5. Developed an Azure Function using Python 3.10 with an HTTP trigger, capable of restarting virtual machines using the azure-mgmt-compute SDK and extracting parameters from the webhook payload.
6. Connected DataDog to the remediation function by adding a webhook monitor with dynamic payload fields for resource group, VM name, and alert context.
7. Simulated high CPU activity on the Linux VM using stress, successfully triggering the alert, invoking the function, and confirming the VM reboot via Azure boot time and function logs.
8. Built a custom dashboard in DataDog to display VM CPU trends, triggered alert history, function invocation timing, and stress test overlays for full operational transparency.
Challenges and Resolutions
Challenge: Webhook not triggering initially
Resolution: Ensured correct GET method, verified Function App URL and CORS settings
Challenge: Azure Function permissions error
Resolution: Used DefaultAzureCredential() and ensured role assignment for VM Contributor
Pictures



