Design, implementation and support of monitoring tools in a complex, heterogenous, multi-platform environment.
Contribute to continual improvement of the system to incorporate advanced monitoring and self-healing capabilities.
Integrate data from various sources into tools, ensuring data accuracy and completeness. Analyze data to identify trends, anomalies, and potential issues.
Create and manage dashboards, custom views, saved searches, and alerts in tools to monitor system performance and availability.
Identify performance bottlenecks and work on optimizing platform performance using insights from tools.
Lead incident response efforts to quickly resolve issues and minimize downtime, leveraging both tools for root cause analysis.
What you bring:
7 to 10 years in setting up and supporting tools Nagios, Splunk and Zabbix.
Proficiency in programming and scripting languages (e.g., Python, Perl, PowerShell and Bash)
7 to 10 years of experience in configuring monitoring and alerts for public cloud, virtualization platforms like VMWare, Operating Systems Windows Server and Linux and Databases MS SQL and Oracle using observability tools Nagios and Splunk
Strong analytical skills, excellent problem-solving abilities, and effective communication skills.