Job Description
Project Role : Technology Support Engineer
Project Role Description : Resolve incidents and problems across multiple business system components and ensure operational stability. Create and implement Requests for Change (RFC) and update knowledge base articles to support effective troubleshooting. Collaborate with vendors and help service management teams with issue analysis and resolution.
Must have skills : Site Reliability Engineering
Good to have skills : NA
Minimum 5 Year(s) Of Experience Is Required
Educational Qualification : 15 years full time education
Role Overview
Site Reliability Engineer – Team Lead with hands-on experience in application and infrastructure automation within a multi-cloud environment. The candidate should have enterprise exposure across Data Center and public cloud platforms, specifically AWS and Azure, along with experience in DevOps, microservices, and coding. The role requires an enterprise architect profile with strong technical expertise focused on automating operational toils and reducing technical debt. The candidate should have extensive exposure to cloud infrastructure using Infrastructure as Code (IaC) and hands-on experience with Java, Python, PowerShell, Ansible, GitHub, Jenkins, Terraform, JSON, Puppet, and related tools. The role requires hands-on engagement and experience in end-client discussions covering technical and business requirements. The candidate should have more than 9+ years of infrastructure experience involving design, build, and deployment across Data Center services and Cloud environments. Familiarity with coding, design, build, and deployment using CI/CD pipelines is required. The candidate should have delivered at least two end-to-end projects covering SRE design, implementation, and support. The role also involves identifying opportunities related to technical debt, waste reduction, and coding techniques, particularly around Infrastructure as Code. Knowledge of SPLUNK DEV is an added advantage. Working knowledge of observability tools and other enterprise monitoring tools is an added plus.
Key Responsibilities
Leadership skills to run it as a business to achieve targeted goals as an SRE.
Working with business and other stakeholders to transform technical & functional requirements.
Should be well versed with Command Center requirements
Exposure to Command Center to manage critical events.
Exposure as an SRE with strong coding background to automating Toils.
Troubleshooting, health check, administration, management, vendor coordination, interaction with external partner, elevation to stakeholders for support or application teams for application development related issues (bug, code maintenance, code evolution)
Capacity monitoring monitoring application availability managements & monitoring, reporting and maintenance activities (if documented)
Work on reduction of repeated failures generate reports, dashboards
Performance review: performance management, tuning, fix issues, work on reduction of repeated failures, scripts, automation
Generate reports, dashboards, deploy agents
Monitor Docker Envelops, maintain Dockers images
Work on reduction of repeated failures generate reports, dashboards
Supporting Compliance requirements
Technical Skills
Exposure to automation – Specially IAC, building Pipelines in public cloud, Deployments, ARM & other templates.
Strong knowledge on Coding – Python, Powershell, Ansible, Jenkins, Terraform, Git, JSON , Puppet etc.
Strong in SRE knowledge and exposure specially identifying toils, techdebt, reducing waste etc.
Designing, Creating an Observability framework
Must have exposure on SPLUNK and other enterprise management tools
IaaS/PaaS products - Support for Containers and Cloud Native Stack
Lateral and Logical Troubleshooting as Cloud admin.
Complete understand of Cloud Network topology
Docker- Design/Built/Deployment – At least 2 years of technical exposure
CI – CD exposure – with Full end-to-end DevOps life cycle experience.
Exposure and hands-on exposure on Publc Cloud – Specially on Infrastructure.
Should be well versed with Monitoring, Observability and other enterprise management tools.
Extensive exposure on coding using python. Powershell, Ansible, Jenkins, Terraform etc.
Exposure as an hands-on SRE atleast for 5+ years.
Soft Skills
Strong leadership capability to guide and motivate teams
Excellent communication and relationship-building skills
Ability to lead teams to achieve project and operational goals
Strong understanding of project management principles and methodologies
Preferred Qualifications
The candidate should have a minimum of 12 years of experience in Site Reliability Engineering.
A 15 years full-time education is required.
Must To Have Skills: Proficiency in Site Reliability Engineering.
Strong understanding of project management principles and methodologies.
Excellent communication and relationship-building skills.
Ability to lead and motivate teams to achieve project goals.
Certifications: One Data Center technologies and Cloud.
Key Deliverables
Automation of operational toils and reduction of technical debt
Infrastructure automation using Infrastructure as Code and CI/CD pipelines
Observability framework design and implementation
Monitoring, capacity monitoring, and application availability management
Reports and dashboards
Deployment of monitoring agents
Docker environment monitoring and Docker image maintenance
Reduction of repeated failures through scripts, automation, tuning, and fixes
Compliance support