This role expects rotational on-call support work on weekends and Mumbai Holidays.
Responsibilities
System and Service Reliability:Ensure the reliability, availability, and performance of production systems. Continuously monitor the health of critical services, identify potential risks, and take proactive measures to prevent outages.
Incident Management and Resolution:Lead incident response efforts for production systems, minimizing downtime and ensuring quick recovery. Perform post-mortem analysis after incidents to identify root causes and implement corrective actions.
Monitoring and Observability:Implement and maintain monitoring, logging, and alerting systems to detect and resolve system issues quickly. Use metrics, logs, and traces to gain insights into system behavior and user experience. Review/support tools for environment monitoring and analysis including system health and performance. Automate repetitive tasks and processes to improve efficiency, reduce human error, and scale systems.
Performance Tuning and Optimization:Identify and resolve performance bottlenecks across systems, networks, and applications. Optimize resource utilization in cloud and on-premise environments to improve system efficiency and reduce operational costs. Conduct capacity planning and scaling exercises to ensure systems meet growing demand.
Collaboration with Development Teams:Partner with software engineering teams to design and implement scalable, fault-tolerant services. Assist in code reviews and provide feedback on operational concerns, such as scalability, security, and reliability. Participate in design and architecture discussions to ensure systems are built for reliability and operational excellence.
Security and Compliance:Ensure security best practices are followed in the design, implementation, and operation of production systems. Regularly review system configurations, logs, and access controls for vulnerabilities and compliance with regulatory requirements. Collaborate with the security team to develop and execute incident response and disaster recovery plans. Coordinate maintenance activities and outages with infrastructure, Application management and Service Center teams.
Documentation and Knowledge Sharing:Maintain clear and comprehensive documentation for system architecture, incident management processes, and troubleshooting procedures. Share knowledge and provide mentorship to junior SREs and other team members. Contribute to knowledge bases and internal tools to improve operational efficiency.
Continuous Improvement:Stay up-to-date with the latest industry trends, tools, and best practices in site reliability engineering. Continuously review and improve operational processes and workflows to increase system uptime and performance. Implement and support a culture of continuous learning and improvement within the team.
Qualifications
3+ years of overall experience working in IT infrastructure Operations environment.
3+ years of relevant experience in Infrastructure / Applications monitoring and technical support.
Should be familiar with Change management, Incident management, release management and problem management.
Good working knowledge of Linux Windows Administration.(Linux is mandatory)
Experience with MS-SQL, Oracle, MongoDB preferred. SSIS DB proficiency preferred
Good problem-solving/troubleshooting skills. Quick learner.
A good understanding of support environments and Service Level Agreements
Experience with monitoring tools such as Splunk, Nagios, Logic Monitor, AppDynamics, etc. preferred
Experience handling major incidents and escalations
Strong written, oral communication interpersonal skills
Experience with SCCM and Windows patch management desired
Understanding of Container and Kubernetes platform preferred
Experience with Development preferred (JAVA/SQL/Ansible/Shell, Batch scripting)
Strong team player, with excellent eye for detail
Should have a high degree of self-motivation, strong analytical and creative problem-solving skills
Good understanding of latest Datacenter technologies