Key Responsibilities
Service Reliability & Automation
- Establish, monitor, and enforce Service Level Indicators (SLIs) and Service Level Objectives (SLOs) for infrastructure tooling, including configuration compliance, patch success rates, and deployment latency
- Provide Level 3 expertise for tooling-specific incidents, focusing on automating incident remediation and reducing MTTR
- Automate repetitive tasks across managed infrastructure to measurably reduce operational overhead (e.g., server build time reductions)
- Conduct root cause analysis and lead blameless postmortems for service-impacting incidents to drive systemic improvements
Infrastructure & Configuration Management
- Engineer and maintain automated scripts for asset management, configuration databases, and monitoring systems
- Design, develop, and deploy full-stack applications, custom plugins, and automation scripts for direct device interaction
- Maintain Infrastructure-as-Code (IaC) configurations for Windows and Linux servers using tools such as Ansible, Terraform, or Puppet
- Implement drift detection and auto-remediation capabilities for configuration compliance
Network & Security Device Automation
- Build API-driven tools for network configuration, firmware updates, pre/post-change validation, and real-time health monitoring
- Deploy monitoring agents, centralized logging, and dashboards with alerts based on critical SLIs (latency, error rates, traffic, saturation)
- Develop automation scripts for intelligent ticket handling, validation, and escalation workflows within enterprise ticketing systems
Monitoring & Continuous Improvement
- Implement and manage monitoring solutions (Prometheus, Grafana, Datadog) and centralized logging platforms (ELK Stack)
- Build custom dashboards, alerts, and reporting for infrastructure and security devices
- Participate in continuous improvement initiatives to enhance automation, tooling reliability, and system resilience