Key Responsibilities
Observability Platform Implementation:
- Design and maintain distributed tracing, metrics, and logging using OpenTelemetry, Prometheus, Loki, and Tempo.
- Ensure complete instrumentation of .NET Core applications for end-to-end visibility. o Implement telemetry pipelines for application logs, performance metrics, and traces.
- Monitoring & Alerting:
- Develop and manage SLIs, SLOs, and error budgets.
- Create actionable, noise-free alerts using Prometheus Alertmanager and Azure Monitor. o Monitor key infrastructure components, applications, and databases with a focus on reliability and performance. Azure & Infrastructure Integration:
- Integrate Azure services (App Services, VMs, Storage, etc.) with the observability stack. o Configure monitoring for MSSQL databases, including performance tuning metrics and health indicators. o Use Azure Monitor, Log Analytics, and custom exporters where necessary.
- Automation & DevOps:
- Automate observability configurations using Terraform, PowerShell, or other IaC tools.
- Integrate telemetry validation and health checks into CI/CD pipelines.
- Maintain observability as code for repeatable deployments and easy scaling.
- Resilience & Reliability Engineering:
- Conduct capacity planning to anticipate scaling needs based on usage patterns and growth.
- Define and implement disaster recovery strategies for critical Azure-hosted services and databases.
- Perform load and stress testing to identify performance bottlenecks and validate infrastructure limits.
- Support release engineering by integrating observability checks and rollback strategies in CI/CD pipelines.
- Apply chaos engineering practices in lower environments to uncover potential reliability risks proactively. Collaboration & Documentation:
- Partner with engineering teams to promote observability best practices in .NET Core development. o Create dashboards (Grafana preferred) and runbooks for system insights and incident response. o Document monitoring standards, troubleshooting guides, and onboarding materials.
Required Skills and Experience
- 4+ years of experience in SRE, DevOps, or infrastructure-focused roles.
- Deep experience with .NET Core application observability using OpenTelemetry.
- Proficiency with Prometheus, Loki, Tempo, and related observability tools.
- Strong background in Azure infrastructure monitoring, including App Services and VMs.
- Hands-on experience monitoring MSSQL databases (deadlocks, query performance, etc.). Familiarity with Infrastructure as Code (Terraform, Bicep) and scripting (PowerShell, Bash).
- Experience building and tuning alerts, dashboards, and metrics for production systems.
Preferred Qualifications
- Azure certifications (e.g., AZ-104, AZ-400).
- Experience with Grafana, Azure Monitor, and Log Analytics integration.
- Familiarity with distributed systems and microservice architectures.
- Prior experience in high-availability, regulated, or customer-facing environments.