We are seeking a talented and experienced Lead Dynatrace SME to drive the implementation and optimization of observability solutions across dynamic, distributed systems. This role will play a pivotal part in designing advanced monitoring frameworks, enhancing incident response workflows, and ensuring high reliability and performance of critical systems.
Responsibilities
- Implement and manage Dynatrace observability solutions across distributed systems and environments
- Migrate dashboards, alerts, and telemetry from Grafana to Dynatrace, ensuring data consistency and performance visibility
- Design and configure telemetry ingestion pipelines using the Dynatrace toolset
- Develop and operationalize SLOs/SLIs and automated alerting frameworks aligned with business KPIs
- Deploy and fine-tune AI-driven anomaly detection and AIOps use cases to improve root-cause analysis and incident prevention
- Create the Order-ID Observability Dashboard for end-to-end visibility of order processing
- Collaborate with L2 and L3 support teams to extend observability coverage and enhance incident response workflows
- Integrate observability insights with ServiceNow and other ITSM tools for unified monitoring and ticket correlation
- Drive continuous improvement in MTTD, MTTR, and overall system resilience through proactive analysis and optimization
- Document observability architecture, dashboards, and operational runbooks in Confluence
Requirements
- 7+ years of experience in Site Reliability Engineering, Observability, or Monitoring roles
- Proven hands-on experience with Dynatrace (dashboards, Smartscape, Davis AI, alerting, tagging, SLOs)
- Solid understanding of AIOps platforms, event correlation, and anomaly detection concepts
- Familiarity with ServiceNow or similar ITSM systems for alert/ticket automation
- Experience building and maintaining observability in Azure environments
- Proficiency in scripting or automation (Python, PowerShell, or similar)
- Strong analytical, diagnostic, and problem-solving skills with attention to system reliability and performance
- English level of minimum B2 (Upper-Intermediate) for effective communication