Job Responsibilities
- Automate the deployment of logging, metrics, and monitoring services through configuration management utilizing Puppet.
- Address and resolve production incidents by applying Linux administration and engineering expertise.
- Lead projects from inception to completion, including designing technical solutions, managing timelines, and executing deliverables.
- Design and implement metrics dashboards and alert criteria to effectively monitor and scale services.
- Participate in a week-long on-call rotation in collaboration with team members.
- Assist development teams in enhancing their logging and metrics collection processes.
- Demonstrate the ability to manage on-call rotations every few weeks.
Typical Qualifications
- Possess 5 to 8 years of prior experience in a production environment, exhibit strong system administration and DevOps skills for managing services within a Linux environment.
- Demonstrate hands-on experience with configuration management tools such as Puppet or Ansible.
- Strong experience troubleshooting production services in a Linux environment and participating in on-call rotations.
- Proficient in programming with experience writing and maintaining scripts in the following languages: Bash, Ruby, Python, Perl, C++, Java, and Golang.
- Experience developing Infrastructure as Code utilizing Terraform and CloudFormation.
- Display adaptability and flexibility in response to changing environmental and business demands.
Additional Qualifications
- Demonstrated experience in managing production server fleets at a scale of thousands.
- Subject matter expertise in relevant technologies, including FluentD, Kafka, Elasticsearch, Graphite, Clickhouse, Prometheus, Grafana, Graylog, Terraform, CloudFormation, Docker, Jenkins, and Git.
- Exposure to Amazon Web Services (AWS) for deploying, managing, and scaling applications, with a foundational understanding of AWS services, architecture, and best practices.
- Proficient in using protocol analyzers such as tcpdump and Wireshark.