Technical Architect-Multi Cloud Site Reliability

IBM

Bengaluru, India

Fresher

This job is no longer accepting applications

Posted 14 days ago

Job Description

Introduction

A career in IBM Consulting is built on long-term client relationships and close collaboration worldwide. You'll work with leading companies across industries, helping them shape their hybrid cloud and AI journeys. With support from our strategic partners, robust IBM technology, and Red Hat, you'll have the tools to drive meaningful change and accelerate client impact. At IBM Consulting, curiosity fuels success. You'll be encouraged to challenge the norm, explore new ideas, and create innovative solutions that deliver real results. Our culture of growth and empathy focuses on your long-term career development while valuing your unique skills and experiences.

Your Role And Responsibilities

Deep understanding of the MultiCloud platforms technology and capabilities to support site reliability goals.
Responsible for identifying the point of failures and performance bottlenecks and provide feedback to the architecture teams.
Develop effective tooling, alerts, and response to both identify and address reliability risks including automatic problem detection and mitigation
Identifies the tools best suitable for integrating to ci/cd pipeline for performance, code quality, code coverage measurement. Defined the quality gates in ci/cd pipeline by working with the application architects.
SRE should define & implement the strategy for availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning.
Identifies and implements the methods for scaling the applications; as well as tools for logging, monitoring, alerting and run book automation for auto remediation(self healing).
Works with the application and support teams during critical situations in identifying the root cause of failures and help fix them.
Incorporates aspects of software engineering and apply that to it operations problems. Applies aspects of software engineering to operations with the goal of creating software systems that are highly scalable and reliable.
Engage with enterprise and business/infrastructure functions to establish, track, and optimize operational metrics and targets in line with SRE principles (SLO/SLI, Latency percentiles , error budgets, tech debt and setup alert guidelines)
Designing, analyzing, and troubleshooting large-scale distributed systems ability to fix both application and infrastructure issues
Programming/Tooling and Automation experience in one or more of the following languages: Java, Python, Ansible, Terraform and Shell programming

Preferred Education

Master's Degree

Required Technical And Professional Expertise

Strong Multi-cloud SRE Experience delivering production support
Experience with IaaS and PaaS providers such as AWS, AZURE, GCP, OpenShift / Opensource ecosystem experience
Experience with containerization and container platforms. (e.g., Docker, Kubernetes)
Excellent automation programming skills using Python, Terraform, Ansible, Java
Experience with enterprise monitoring solutions like Dynatrace, Instna, Prometheus, Grafana, Nagios and Splunk
Good understanding of Jenkins, CI/CD architecture and Good understanding of Middleware products such as Kafka, MQ, SQL/NoSQL, DB

Preferred Technical And Professional Experience

Cloud Native Applications: Experience with cloud native applications and their integration with multiple cloud platforms, including design and implementation of scalable and reliable software systems.
Advanced CI/CD Pipelines: Experience in integrating advanced tools into CI/CD pipelines to ensure seamless application scaling and auto-remediation, including expertise in identifying failure points and performance bottlenecks.
Multi-Cloud Strategy: Experience in developing strategies for availability, latency, performance, and capacity planning across multiple cloud platforms, including defining quality gates and implementing scalable software systems.