Introduction
A career in IBM Consulting is built on long-term client relationships and close collaboration worldwide. You'll work with leading companies across industries, helping them shape their hybrid cloud and AI journeys. With support from our strategic partners, robust IBM technology, and Red Hat, you'll have the tools to drive meaningful change and accelerate client impact. At IBM Consulting, curiosity fuels success. You'll be encouraged to challenge the norm, explore new ideas, and create innovative solutions that deliver real results. Our culture of growth and empathy focuses on your long-term career development while valuing your unique skills and experiences.
Your Role And Responsibilities
- Deep understanding of the MultiCloud platforms technology and capabilities to support site reliability goals.
- Responsible for identifying the point of failures and performance bottlenecks and provide feedback to the architecture teams.
- Develop effective tooling, alerts, and response to both identify and address reliability risks including automatic problem detection and mitigation
- Identifies the tools best suitable for integrating to ci/cd pipeline for performance, code quality, code coverage measurement. Defined the quality gates in ci/cd pipeline by working with the application architects.
- SRE should define & implement the strategy for availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning.
- Identifies and implements the methods for scaling the applications; as well as tools for logging, monitoring, alerting and run book automation for auto remediation(self healing).
- Works with the application and support teams during critical situations in identifying the root cause of failures and help fix them.
- Incorporates aspects of software engineering and apply that to it operations problems. Applies aspects of software engineering to operations with the goal of creating software systems that are highly scalable and reliable.
- Engage with enterprise and business/infrastructure functions to establish, track, and optimize operational metrics and targets in line with SRE principles (SLO/SLI, Latency percentiles , error budgets, tech debt and setup alert guidelines)
- Designing, analyzing, and troubleshooting large-scale distributed systems ability to fix both application and infrastructure issues
- Programming/Tooling and Automation experience in one or more of the following languages: Java, Python, Ansible, Terraform and Shell programming
Preferred Education
Master's Degree
Required Technical And Professional Expertise
- Strong Multi-cloud SRE Experience delivering production support
- Experience with IaaS and PaaS providers such as AWS, AZURE, GCP, OpenShift / Opensource ecosystem experience
- Experience with containerization and container platforms. (e.g., Docker, Kubernetes)
- Excellent automation programming skills using Python, Terraform, Ansible, Java
- Experience with enterprise monitoring solutions like Dynatrace, Instna, Prometheus, Grafana, Nagios and Splunk
- Good understanding of Jenkins, CI/CD architecture and Good understanding of Middleware products such as Kafka, MQ, SQL/NoSQL, DB
Preferred Technical And Professional Experience
- Cloud Native Applications: Experience with cloud native applications and their integration with multiple cloud platforms, including design and implementation of scalable and reliable software systems.
- Advanced CI/CD Pipelines: Experience in integrating advanced tools into CI/CD pipelines to ensure seamless application scaling and auto-remediation, including expertise in identifying failure points and performance bottlenecks.
- Multi-Cloud Strategy: Experience in developing strategies for availability, latency, performance, and capacity planning across multiple cloud platforms, including defining quality gates and implementing scalable software systems.