Work closely with development, operations, and product teams to ensure monitoring solutions align with business goals.
Create and maintain scripts and automation tools to streamline monitoring and alerting processes
Produce and maintain clear documentation on monitoring setups, best practices, and troubleshooting procedures.
Train team members and stakeholders on effective use and management of Datadog tools and features.
Monitor the performance and availability of software systems, identify and resolve issues, and implement proactive measures to prevent future incidents.
Design and maintain fault-tolerant architectures using redundancy, load balancing, and automated failover mechanisms to minimize downtime and ensure seamless service availability.
Develop and implement automation strategies to reduce manual intervention and improve system reliability.
Optimize system performance through proactive monitoring and tuning.
Prepare and execute disaster recovery plans to ensure business continuity.
Work closely with development and operations teams to bridge the gap between them, ensuring smooth deployment and operation of applications.
Incident Management
Follow incident management process, ensuring timely resolution and minimizing service disruptions.
Conduct root cause analysis and implement preventive measures to reduce recurring incidents.
Develop and maintain incident response procedures and communication protocols.
Change Management
Manage the change management process, ensuring controlled and efficient implementation of changes
Assess the impact of proposed changes and mitigate potential risks.
Ensure compliance with change management policies and procedures.
Metrics And Eporting
Generate regular reports and dashboards to provide insights into service performance.
Use data-driven insights to identify trends and drive continuous improvement.
Transformation And Automation
Identify opportunities for process automation and implement solutions to improve efficiency.
Evaluate and implement new monitoring tools
Key Requirements
Proven expertise in multiple monitoring tools
Minimum of 8 years of experience in monitoring and DevOps skills.
Proficiency in scripting, coding and software development principles
Strong understanding of IT operations and system management.
Strong experience with automation tools and frameworks.
Excellent troubleshooting and problem-solving skills.