Responsible for SRE Support for Container platforms apply SRE knowledge to identify potential gaps in the observability design or implementation.
Work with the clients, Application and development Teams to onboard the applications and integrate with CI/CD platform.
Be able to provide technical expertise to Configure, Deploy, and Support Bank workloads to securely run and operate in Container Infra (K8s/RedHat Open Shift/AKS).
Responsible for engineering of new capabilities to the OpenShift/Container Platforms and delivering those capabilities in a fully automated and supportable fashion.
Implement cluster services to manage On-Prem Bare Metal Open shift cluster deployments and off-prem deployments.
Work with monitoring tools and Application Development teams to enhance monitoring capabilities and modify monitoring dashboards for new observability plans created in support of initiatives or continuous improvement efforts.
Develop software or system scripts to simplify or eliminate the dependence on human intervention for recurring tasks.
Work with Production Support teams to perform knowledge transfer, playbook updates and training for new monitoring capabilities.
Identify vulnerabilities and opportunities for reliability improvement, such as investigating low level error rates and noise in monitoring and to help define solutions to improve system reliability.
Develop and maintain a catalog of extensible reliability scripts, tools, and libraries that can be leveraged for common instrumentation, automation and operational needs.
Be able to provide technical expertise to Configure, Deploy, and Support Bank workloads to securely run and operate in Container Infra (K8s/RedHat Open Shift/AKS).
Responsible for engineering of new capabilities to the OpenShift/Container Platforms and delivering those capabilities in a fully automated and supportable fashion.
Implement cluster services to manage On-Prem Bare Metal Open shift cluster deployments and off-prem deployments.
Responsible for SRE Support for Container platforms apply SRE knowledge to identify potential gaps in the observability design or implementation.
Work with the clients, Application and development Teams to onboard the applications and integrate with CI/CD platform.
Be able to provide technical expertise to Configure, Deploy, and Support Bank workloads to securely run and operate in Container Infra (K8s/RedHat Open Shift/AKS).
Responsible for engineering of new capabilities to the OpenShift/Container Platforms and delivering those capabilities in a fully automated and supportable fashion.
Implement cluster services to manage On-Prem Bare Metal Open shift cluster deployments and off-prem deployments.
Work with monitoring tools and Application Development teams to enhance monitoring capabilities and modify monitoring dashboards for new observability plans created in support of initiatives or continuous improvement efforts.
Develop software or system scripts to simplify or eliminate the dependence on human intervention for recurring tasks.
Work with Production Support teams to perform knowledge transfer, playbook updates and training for new monitoring capabilities.
Identify vulnerabilities and opportunities for reliability improvement, such as investigating low level error rates and noise in monitoring and to help define solutions to improve system reliability.
Develop and maintain a catalog of extensible reliability scripts, tools, and libraries that can be leveraged for common instrumentation, automation and operational needs.
Be able to provide technical expertise to Configure, Deploy, and Support Bank workloads to securely run and operate in Container Infra (K8s/RedHat Open Shift/AKS).
Responsible for engineering of new capabilities to the OpenShift/Container Platforms and delivering those capabilities in a fully automated and supportable fashion.
Implement cluster services to manage On-Prem Bare Metal Open shift cluster deployments and off-prem deployments.
Requirements
Education: B.E. / B. Tech / M.E. / M. Tech / MCA
Certifications If Any: N/A
Experience Range: 8 to 10 years
Foundational Skills
Experience as a Site Reliability Engineer within large, multinational organizations, with a preference for implementations of new technologies with a proven track record of success.
Demonstrated ability to design and develop significant components within an application.
Expertise in supporting Container production (K8s/RedHat Openshift) environments , and associated maintenance, change control, incident and problem management
Strong experience in Linux administration, programming experience in at least one language (Python, Shell scripting, Java etc) and Cloud-native technologies.
Strong experience in onboarding applications to container and multi-cloud platforms - Azure, AWS, GCP, IBM Cloud
Strong experience in Infrastructure automation using either of Terraform/Packer, Ansible or Python
Understanding or exposure of agile as well as ITSM incident/change/request management processes.
Experience of implementing platform resiliency, self-healing, health compliance dashboards, automation for day to day operational tasks over hybrid cloud for enterprise class production grade environment is desired.
Experience in PaaS logging, monitoring, and observability tools such as ELK, FluentD, Prometheus, Splunk, Nagios, Datadog, etc.
Experience in building large scale distributed enterprise platforms with focus on performance, scale, security, and reliability
Self-motivated and results oriented with excellent analytical, problem solving, interpersonal, presentation and communication skills.
Operate in a fast-paced environment with multiple concurrent priorities
Desired Skills
Experience in designing, analyzing and troubleshooting large scale distributed systems and good understanding of multi-vendor Cloud offerings.
Experience in cloud-native network, storage, and virtualization technologies
Experience in DevOps and GitOps models with IaaS, Config-as-Code, Policy-as-Code and CI/CD tools - bit bucket, jfrog, Jenkins, Artifactory, Ansible
Experience with modern performance monitoring and diagnostics tools (examples: Splunk, Splunk ITSI, AppD, Dynatrace, SolarWinds, etc.)
Understand relevant application technologies and development life cycles.
Operational Process Routines: Strong adherence to operating controls, risk management, process review and creation, documentation and collaborative knowledge sharing.
Ability to use qualitative and quantitative analytical skills to assess the effectiveness of the operations, manage competing priorities and adapt to change in project scope .
Proven ability to work independently with minimal supervision and as part of a team with direct responsibilities.