Search by job, company or skills

V

Senior Site Reliability Engineer (HPC/Cloud)

Save
new job description bg glownew job description bg glownew job description bg svg
  • Posted 19 hours ago
  • Be among the first 10 applicants
Early Applicant

Job Description

Key Responsibilities

  • Respond to and resolve operational incidents, identify root causes for critical issues, and implement strategies to prevent recurrence and improve platform resiliency.
  • Proactively create and manage monitoring, logging, and alerting systems to ensure high availability, performance, and visibility across all services.
  • Take a Site Reliability Engineering approach to our services, improving the deployment, monitoring and incident response end-to-end.
  • Solve complex technical problems, with SCP applications, infrastructure and end user's use of the services.
  • Administer platform tools like Ansible, Vault, Consul, Prometheus, and Grafana to support core functions like configuration management, secrets management, monitoring, and observability.
  • Mentor and coach junior engineers in the team, fostering a collaborative and high-performing culture.
  • Drive automation for deployment and management processes using GitOps workflows as well as CI/CD pipelines.

Essential Knowledge, Skills, And Experience

  • Experienced administering, maintaining and troubleshooting a Linux environment
  • Competent in automation and bash scripting
  • Highly customer focused; able to explain IT technical concepts in a manner which non-IT experts can understand
  • Hands-on experience working in a DevOps team and using agile methodologies

Plus Some Of The Following Areas Of Expertise

  • Hands-on knowledge of a range of scientific and HPC applications such as simulation software, bioinformatics tools or 3D data visualization packages
  • Experience administering and optimizing SLURM
  • Experience deploying and administering OpenStack
  • Experience with configuration automation and infrastructure as code (e.g. Ansible, Hashicorp Terraform, AWS CloudFormation, Amazon Cloud Developer Kit)
  • Experience deploying infrastructure and code to public cloud, especially AWS
  • Experience with software distribution frameworks such as Easybuild or Spack
  • Familiarity with container runtimes such as Docker, Singularity or enroot
  • Experience with frameworks for regression tests and benchmarks for HPC applications, like Reframe HPC

More Info

Job Type:
Industry:
Employment Type:

About Company

Job ID: 147197727