Responsibilities
Key Responsibilities:Qualifications
- Platform Ownership & Cost Optimization
- Maintain and enhance the Grafana and Elastic platforms to ensure high availability and performance.
- Implement cost control mechanisms to optimize resource utilization across Observability platforms.
- Establish platform guardrails, best practices, and governance models.
- BAU Support & Vendor/Partner Management
- Manage day-to-day operations, troubleshooting, and platform improvements.
- Engage and manage third-party vendors and partners to ensure SLA adherence and platform reliability.
- Work closely with procurement and finance teams to manage vendor contracts and renewals.
- Application Onboarding & Collaboration
- Partner with application owners and engineering teams to onboard applications onto the Observability platform.
- Define standardized onboarding frameworks and processes for application teams.
- Ensure seamless integration with existing observability solutions like AppDynamics, ServiceNow ITOM, and other monitoring tools.
- AI Ops & Advanced Features Implementation
- Deploy AI Ops capabilities within Grafana and Elastic to enhance proactive monitoring and anomaly detection.
- Implement automation and intelligent alerting to reduce MTTR and operational overhead.
- Stay updated with industry trends and recommend innovative AI-driven observability enhancements.
- Cross-Functional Collaboration
- Work closely with architects of AppDynamics, ServiceNow, and other Observability platforms to ensure an integrated monitoring strategy.
- Align with ITSM, DevOps, and Cloud teams to create a holistic observability roadmap.
- Lead knowledge-sharing sessions and create technical documentation for the team.
- People & Team Management
- Lead and managed a team responsible for Grafana and Elastic observability operations.
- Provide mentorship, coaching, and career development opportunities for team members.
- Define team goals, monitor performance, and drive continuous improvement in Observability practices.
- Foster a culture of collaboration, innovation, and accountability within the team.
Technical Expertise
- 12+ years of experience in IT Operations, Observability, or related fields.
- Strong expertise in Grafana and Elastic Stack (Elasticsearch, Logstash, Kibana).
- Experience in implementing AI Ops, machine learning, or automation within observability platforms.
- Proficiency in scripting and automation (Python, Ansible, Terraform) for Observability workloads.
- Hands-on experience with cloud-based Observability solutions, particularly in Azure environments.
- Familiarity with additional monitoring tools like AppDynamics, ServiceNow ITOM, SevOne, and ThousandEyes.
Leadership & Collaboration
- Experience in managing vendors, contracts, and external partnerships.
- Strong stakeholder management skills and ability to work cross-functionally.
- Excellent communication and presentation skills.
- Ability to lead and mentor junior engineers in Observability best practices.