As a Senior Site Reliability Engineer, you will be responsible for:
- Demonstrating best practices pertaining to Cloud DevOps development along with a willingness to continually learn Cloud native technologies.
- Following security guidelines to develop secure and compliant Cloud services by working with Risk and Security teams.
- Monitoring configuration management, platform layout, and hosting infrastructure.
- Automating deployment of applications and infrastructure
- Be able to work independently and in a team environment managing a range of customers and technical situations.
- Providing technical application support for enterprise-level systems
- Running our infrastructure with Chef, Ansible, Terraform, Github CI/CD, and Kubernetes
- Participating in Capacity planning, system performance monitoring, resource utilization trending and incident and change management.
- Co-ordinating with Cloud infrastructure partners for Server, Network, Database, service-related incidents, and projects
- Deploying application upgrades/patches in production and test environments
- Troubleshooting application alerts, Azure and AWS Policy from monitoring tools and code inspection and performing RCAs
- Writing tutorials, how-to videos, and other technical articles for the customer community and knowledgebase articles and keep them up to date
- Working on critical, complex customer problems that may span multiple services
- Participating in 24x7 on-call rotation and working with global teams
- Collaborating with cross functional stakeholders
- Providing mentorship and guidance to team members
- Ensuring security best practices are integrated into the development lifecycle, including compliance with data protection regulations.
- Collaborating with stakeholders to understand requirements, set priorities, and communicate progress and challenges.
Fuel your passion
To be successful in this role you will:
- Have bachelors degree in computer science or STEM Majors (Science, Technology, Engineering and Math) with 7-10 years of experience in total.
- Have 5-8 years of experience with cloud infrastructure platforms such as AWS and Azure. Have prior experience in setting up, running and configuring Cloud applications.
- Have 5+ years of Hands-on experience with Public Cloud-based applications, technologies and tools, deployment, monitoring, and operations, such as Docker, Kubernetes, etc.
- Have 5+ years of Experience in Linux (RHEL) operating system performance monitoring parameters and their interpretation, commands used for monitoring
- Have Mastery in collaborative software development using Git, Jira, Confluence etc.
- Have experience in infrastructure optimization in Cloud.
- Have deep understanding of operating and monitoring Java applications and Dockerized containers
- Have hands-on experience in CI-CD (AWS CodePipeline, Azure DevOps, GitLab CI/CD, Jenkins) and IaC tools (Terraform, AWS CloudFormation, Ansible etc.)
- Be an expert in performance monitoring and capacity management of enterprise systems using various tools.
- Have experience in Observability - APM tools (Dynatrace, AppDynamics etc.), metrics / log consolidation (Splunk) and logging tools such as Prometheus, Grafana, and the ELK stack is essential.
- Have knowledge of application design patterns, J2EE application architectures, Microservices, Spring boot & Cloud native architectures
- Have proficiency in Java runtimes, Core Java, Garbage collection, JVM parameters tuning
- Have experience in RDBMS and NoSQL database technologies
- Have Knowledge in automation scripting language like Python/Linux Shell scripting / Windows Powershell
- Have experience in Change management and Incident management process