Search by job, company or skills

SymphonyAI

Site Reliability Engineer - Architect/Principal

Save
new job description bg glownew job description bg glownew job description bg svg
  • Posted 18 hours ago
  • Be among the first 10 applicants
Early Applicant

Job Description

Introduction

Job Title:

Site Reliability Engineer - Architect / Principal

Department:

Engineering - IRIS Smart Manufacturing Platform

Overview:

SymphonyAI is at the forefront of innovation, leveraging cutting-edge artificial intelligence and machine learning technologies to transform industries and drive business growth. As a global leader in AI-powered solutions, SymphonyAI empowers organizations with enterprise applications that rapidly deliver transformative business value across retail, CPG, financial services, manufacturing, media, Enterprise IT, and the public sector. We are on a mission to build a World Class Engineering Team with a high-performance culture.

Our solutions, hosted on the Iris Smart Manufacturing platform, combine equipment and process domain expertise in Mining & Metals, Oil & Gas, Chemicals & Petrochemicals with the state-of-the-art in data sciences, machine learning, and process optimization. The IRIS platform supports hybrid deployments and is built using microservices architecture.

We are seeking a highly skilled SRE Architect / Principal to design, implement, and maintain highly available, scalable, and secure systems across cloud and on-premise environments. The ideal candidate will combine deep technical expertise with a strategic mindset to drive reliability, automation, and performance across mission-critical applications. This is a hands-on role with architect-level responsibilities, including mentoring teams, shaping platform reliability practices, and influencing operational strategy.

Job Description

Responsibilities:

  • Contribute to the IRIS Platform SRE and operations road map and execute planned research and development
  • Lead the design, deployment, and operations of large-scale systems on AWS (EKS) or Azure (AKS), ensuring reliability, scalability, and security.
  • Serve as the principal architect for platform reliability, performance, and disaster recovery strategies.
  • Troubleshoot active Production issues, provide RCA, co-ordinate with DevOps and Backend teams for permanent fixes. Primarily point of L2 contact for performance issues and flexible for non-office hours troubleshooting needs.
  • Able to install observability stack in Kubernetes such as EFK, Prometheus and Grafana
  • Able to create SLO/Grafana widgets using Prometheus queries. Should know all metric types and most common Prometheus functions.
  • Architect, install and govern Service Mesh (Istio/Linkerd) adoption for secure service-to-service communication, traffic management, and zero-trust networking
  • Apply SRE best practices, including SLIs, SLOs, error budgets, incident management, and post-mortem analysis.
  • Implement best alerting strategies to reduce noise and improve actionable incident detection
  • Optimize platform performance, scalability, and cloud cost efficiency (FinOps practices) across Kubernetes and cloud environments
  • Implement Infrastructure as Code using Terraform, CloudFormation, or AWS CDK for cloud and Kubernetes environments.
  • Optimize platform performance, scalability, and cost in cloud and hybrid environments.
  • Ensure compliance with security, governance, and operational standards across all deployments.
  • Mentor junior engineers and act as a technical authority on reliability, cloud architecture, and DevOps

Required Skills & Qualifications:

  • 7+ years of experience in Site Reliability Engineering (SRE).
  • 3+ years of hands-on experience working with Linux systems.
  • 4+ years of commercial experience with Kubernetes.
  • 2+ years of experience working with Docker.
  • 4+ years of experience setting up and managing CI/CD pipelines.
  • 4+ years of experience working with automation tools such as Terraform and Ansible.
  • Experience with containerization technologies, including Helm, and CI/CD pipelines.
  • Good knowledge of security best practices and vulnerability management tools (e.g., Acunetix, Snyk, CheckMarx, Trivy).
  • Experience troubleshooting production issues and performing root cause analysis.
  • Ability to work effectively in an Agile environment.

Preferred Skills & Qualifications:

  • Operational knowledge of databases (Postgres, ElasticSearch, Redis, or similar).
  • Exposure to configuring web servers such as Nginx.
  • Working knowledge of monitoring tools such as Grafana and Prometheus.
  • Working knowledge of a messaging framework such as Event Hub, Kafka, RabbitMQ, or similar.

About Us

SymphonyAI is building the leading enterprise AI SaaS company for digital transformation across the most critical and resilient growth industries, including retail, consumer packaged goods, financial crime prevention, manufacturing, media, and IT service management. Since its founding in 2017, SymphonyAI today serves 1500+ Enterprise customers globally and has grown to 3,000 talented leaders, data scientists, and other professionals across over 30 countries.

Diversity & Inclusion Statement:

We are committed to building a diverse and inclusive team and encourage candidates from all backgrounds to apply.

More Info

Job Type:
Industry:
Function:
Employment Type:

About Company

Job ID: 147202875