--- mode: 'agent' description: 'Analyze Azure resource health, diagnose issues from logs and telemetry, and create a remediation plan for identified problems.' --- # Azure Resource Health & Issue Diagnosis This workflow analyzes a specific Azure resource to assess its health status, diagnose potential issues using logs and telemetry data, and develop a comprehensive remediation plan for any problems discovered. ## Prerequisites - Azure MCP server configured and authenticated - Target Azure resource identified (name and optionally resource group/subscription) - Resource must be deployed and running to generate logs/telemetry - Prefer Azure MCP tools (`azmcp-*`) over direct Azure CLI when available ## Workflow Steps ### Step 1: Get Azure Best Practices **Action**: Retrieve diagnostic and troubleshooting best practices **Tools**: Azure MCP best practices tool **Process**: 1. **Load Best Practices**: - Execute Azure best practices tool to get diagnostic guidelines - Focus on health monitoring, log analysis, and issue resolution patterns - Use these practices to inform diagnostic approach and remediation recommendations ### Step 2: Resource Discovery & Identification **Action**: Locate and identify the target Azure resource **Tools**: Azure MCP tools + Azure CLI fallback **Process**: 1. **Resource Lookup**: - If only resource name provided: Search across subscriptions using `azmcp-subscription-list` - Use `az resource list --name ` to find matching resources - If multiple matches found, prompt user to specify subscription/resource group - Gather detailed resource information: - Resource type and current status - Location, tags, and configuration - Associated services and dependencies 2. **Resource Type Detection**: - Identify resource type to determine appropriate diagnostic approach: - **Web Apps/Function Apps**: Application logs, performance metrics, dependency tracking - **Virtual Machines**: System logs, performance counters, boot diagnostics - **Cosmos DB**: Request metrics, throttling, partition statistics - **Storage Accounts**: Access logs, performance metrics, availability - **SQL Database**: Query performance, connection logs, resource utilization - **Application Insights**: Application telemetry, exceptions, dependencies - **Key Vault**: Access logs, certificate status, secret usage - **Service Bus**: Message metrics, dead letter queues, throughput ### Step 3: Health Status Assessment **Action**: Evaluate current resource health and availability **Tools**: Azure MCP monitoring tools + Azure CLI **Process**: 1. **Basic Health Check**: - Check resource provisioning state and operational status - Verify service availability and responsiveness - Review recent deployment or configuration changes - Assess current resource utilization (CPU, memory, storage, etc.) 2. **Service-Specific Health Indicators**: - **Web Apps**: HTTP response codes, response times, uptime - **Databases**: Connection success rate, query performance, deadlocks - **Storage**: Availability percentage, request success rate, latency - **VMs**: Boot diagnostics, guest OS metrics, network connectivity - **Functions**: Execution success rate, duration, error frequency ### Step 4: Log & Telemetry Analysis **Action**: Analyze logs and telemetry to identify issues and patterns **Tools**: Azure MCP monitoring tools for Log Analytics queries **Process**: 1. **Find Monitoring Sources**: - Use `azmcp-monitor-workspace-list` to identify Log Analytics workspaces - Locate Application Insights instances associated with the resource - Identify relevant log tables using `azmcp-monitor-table-list` 2. **Execute Diagnostic Queries**: Use `azmcp-monitor-log-query` with targeted KQL queries based on resource type: **General Error Analysis**: ```kql // Recent errors and exceptions union isfuzzy=true AzureDiagnostics, AppServiceHTTPLogs, AppServiceAppLogs, AzureActivity | where TimeGenerated > ago(24h) | where Level == "Error" or ResultType != "Success" | summarize ErrorCount=count() by Resource, ResultType, bin(TimeGenerated, 1h) | order by TimeGenerated desc ``` **Performance Analysis**: ```kql // Performance degradation patterns Perf | where TimeGenerated > ago(7d) | where ObjectName == "Processor" and CounterName == "% Processor Time" | summarize avg(CounterValue) by Computer, bin(TimeGenerated, 1h) | where avg_CounterValue > 80 ``` **Application-Specific Queries**: ```kql // Application Insights - Failed requests requests | where timestamp > ago(24h) | where success == false | summarize FailureCount=count() by resultCode, bin(timestamp, 1h) | order by timestamp desc // Database - Connection failures AzureDiagnostics | where ResourceProvider == "MICROSOFT.SQL" | where Category == "SQLSecurityAuditEvents" | where action_name_s == "CONNECTION_FAILED" | summarize ConnectionFailures=count() by bin(TimeGenerated, 1h) ``` 3. **Pattern Recognition**: - Identify recurring error patterns or anomalies - Correlate errors with deployment times or configuration changes - Analyze performance trends and degradation patterns - Look for dependency failures or external service issues ### Step 5: Issue Classification & Root Cause Analysis **Action**: Categorize identified issues and determine root causes **Process**: 1. **Issue Classification**: - **Critical**: Service unavailable, data loss, security breaches - **High**: Performance degradation, intermittent failures, high error rates - **Medium**: Warnings, suboptimal configuration, minor performance issues - **Low**: Informational alerts, optimization opportunities 2. **Root Cause Analysis**: - **Configuration Issues**: Incorrect settings, missing dependencies - **Resource Constraints**: CPU/memory/disk limitations, throttling - **Network Issues**: Connectivity problems, DNS resolution, firewall rules - **Application Issues**: Code bugs, memory leaks, inefficient queries - **External Dependencies**: Third-party service failures, API limits - **Security Issues**: Authentication failures, certificate expiration 3. **Impact Assessment**: - Determine business impact and affected users/systems - Evaluate data integrity and security implications - Assess recovery time objectives and priorities ### Step 6: Generate Remediation Plan **Action**: Create a comprehensive plan to address identified issues **Process**: 1. **Immediate Actions** (Critical issues): - Emergency fixes to restore service availability - Temporary workarounds to mitigate impact - Escalation procedures for complex issues 2. **Short-term Fixes** (High/Medium issues): - Configuration adjustments and resource scaling - Application updates and patches - Monitoring and alerting improvements 3. **Long-term Improvements** (All issues): - Architectural changes for better resilience - Preventive measures and monitoring enhancements - Documentation and process improvements 4. **Implementation Steps**: - Prioritized action items with specific Azure CLI commands - Testing and validation procedures - Rollback plans for each change - Monitoring to verify issue resolution ### Step 7: User Confirmation & Report Generation **Action**: Present findings and get approval for remediation actions **Process**: 1. **Display Health Assessment Summary**: ``` 🏥 Azure Resource Health Assessment 📊 Resource Overview: • Resource: [Name] ([Type]) • Status: [Healthy/Warning/Critical] • Location: [Region] • Last Analyzed: [Timestamp] 🚨 Issues Identified: • Critical: X issues requiring immediate attention • High: Y issues affecting performance/reliability • Medium: Z issues for optimization • Low: N informational items 🔍 Top Issues: 1. [Issue Type]: [Description] - Impact: [High/Medium/Low] 2. [Issue Type]: [Description] - Impact: [High/Medium/Low] 3. [Issue Type]: [Description] - Impact: [High/Medium/Low] 🛠️ Remediation Plan: • Immediate Actions: X items • Short-term Fixes: Y items • Long-term Improvements: Z items • Estimated Resolution Time: [Timeline] ❓ Proceed with detailed remediation plan? (y/n) ``` 2. **Generate Detailed Report**: ```markdown # Azure Resource Health Report: [Resource Name] **Generated**: [Timestamp] **Resource**: [Full Resource ID] **Overall Health**: [Status with color indicator] ## 🔍 Executive Summary [Brief overview of health status and key findings] ## 📊 Health Metrics - **Availability**: X% over last 24h - **Performance**: [Average response time/throughput] - **Error Rate**: X% over last 24h - **Resource Utilization**: [CPU/Memory/Storage percentages] ## 🚨 Issues Identified ### Critical Issues - **[Issue 1]**: [Description] - **Root Cause**: [Analysis] - **Impact**: [Business impact] - **Immediate Action**: [Required steps] ### High Priority Issues - **[Issue 2]**: [Description] - **Root Cause**: [Analysis] - **Impact**: [Performance/reliability impact] - **Recommended Fix**: [Solution steps] ## 🛠️ Remediation Plan ### Phase 1: Immediate Actions (0-2 hours) ```bash # Critical fixes to restore service [Azure CLI commands with explanations] ``` ### Phase 2: Short-term Fixes (2-24 hours) ```bash # Performance and reliability improvements [Azure CLI commands with explanations] ``` ### Phase 3: Long-term Improvements (1-4 weeks) ```bash # Architectural and preventive measures [Azure CLI commands and configuration changes] ``` ## 📈 Monitoring Recommendations - **Alerts to Configure**: [List of recommended alerts] - **Dashboards to Create**: [Monitoring dashboard suggestions] - **Regular Health Checks**: [Recommended frequency and scope] ## ✅ Validation Steps - [ ] Verify issue resolution through logs - [ ] Confirm performance improvements - [ ] Test application functionality - [ ] Update monitoring and alerting - [ ] Document lessons learned ## 📝 Prevention Measures - [Recommendations to prevent similar issues] - [Process improvements] - [Monitoring enhancements] ``` ## Error Handling - **Resource Not Found**: Provide guidance on resource name/location specification - **Authentication Issues**: Guide user through Azure authentication setup - **Insufficient Permissions**: List required RBAC roles for resource access - **No Logs Available**: Suggest enabling diagnostic settings and waiting for data - **Query Timeouts**: Break down analysis into smaller time windows - **Service-Specific Issues**: Provide generic health assessment with limitations noted ## Success Criteria - ✅ Resource health status accurately assessed - ✅ All significant issues identified and categorized - ✅ Root cause analysis completed for major problems - ✅ Actionable remediation plan with specific steps provided - ✅ Monitoring and prevention recommendations included - ✅ Clear prioritization of issues by business impact - ✅ Implementation steps include validation and rollback procedures