
Search by job, company or skills
The Technology Operations Lead owns and drives Production stability, observability and operational governance of digital platforms/application ensuring seamless functioning of platform in production environments.
Title :
Technology Operations Lead
Position Objectives
The role will act as the single point of ownership for production operations, responsible for incident & problem management, change release governance, observability effectiveness, while ensuring alignment with business SLAs, regulatory requirements and enterprise standards.
Indicative Responsibilities
1. Application Production Support
• Own availability, reliability & performance of production business application/digital platform
• Own operational acceptance of applications before they go live; Ensure readiness across: Support model (L1/L2/L3), Documentation and runbooks, Capacity and performance baselines, DR and backup readiness
• Sign-off from Application owner, that applications are fit for production support
• Ensure adherence to SLA, uptime and performance benchmarks
• Maintain end-to-end visibility across application and infrastructure layers
• Govern capacity planning especially for peak loads and business events
• Participate DR drills and failover testing
• Support vulnerability remediation prioritization
2. Incident Management (Command & Control)
• Act as Incident Commander for P1 incidents - Drive war rooms, triage and cross-team coordination (App & Infra)
• Ensure rapid restoration of services and structured internal stakeholder communication across teams • Track and reduce incident frequency and impact
• Ensure incidents are logged, tracked, categorized, and closed as per ITSM processes
3. Problem Management & RCA Governance
• Validate the quality, depth, and accuracy of RCAs provided by internal teams and vendors/partners.
• Ensure permanent fixes and prevention of recurring issues
• Maintain and track problem backlog and corrective actions
4. Change & Release Governance
• Participate in change and release governance from a production stability perspective
• Review production readiness for releases, including Rollback and recovery plans, monitoring and alerting readiness, support runbooks and escalation models
• Approve/reject changes based on change process completeness
• Ensure controlled and stable release cycle
5. Observability & Monitoring Governance
• Govern (Application Performance Monitoring) APM & metrics - Maintain visibility across application and infrastructure dependencies
• Contribute to enhancing infrastructure monitoring frameworks.
• Improve alert quality, reduce noise and ensure actionable monitoring
• Enable proactive detection of issues
6. Vendor Management & Governance
• Manage vendor partners for production operations
• Ensure adherence to SLA, response timelines and quality standards
• Prevent blame shifting and enforce clear ownership & accountability
• Drive performance reviews and escalation management
• Seek monthly and quarterly operations health reports
• Own and validate production operations dashboards shared by partner/vendor covering availability, incidents, business journeys, change stability and observability effectiveness
8. Continuous Improvement & Operational Excellence
• Identify patterns in incidents and performance issues
• Drive process improvements and operational maturity
• Improve MTTD, MTTR and overall system reliability
Reports To Head Infrastructure
Coverage / Sub functions
• Technology Operations – Production Application Stability & Performance
• Incident & Problem Management
• Change Release Management
• Observability & Monitoring Governance
• Operational Readiness & Business Continuity
Key Skills & Competencies
A) Technical Skills
• Hands-on understanding of AWS cloud services including Kubernetes, containerized application platforms and distributed systems distributed systems concepts (timeouts, retries, partial failures, and cascading impact)
• Operational understanding of Storage & Database services (including RDS, Aurora, Document DB, etc)
• Strong understanding of application architectures, APIs, and microservices-based platforms
• Ability to trace end-to-end request flows across multiple services
• Ability to correlate logs, metrics, and traces to diagnose production issues
• Knowledge of observability tools (APM, ELK/OpenSearch, Prometheus, Grafana, Jaeger)
• Experience in incident, problem and change management (ITIL practices)
• Understanding of infrastructure and system dependencies
• Ability to analyze and troubleshoot cloud-specific failure patterns such as throttling, saturation, connectivity issues, and regional dependencies
B) Strategic Thinking and Problem-Solving
• Ability to analyze infrastructure challenges and propose reliable and scalable solutions.
• Ability to drive end-to-end issue resolution across multiple domains
• Strong analytical approach to incident trends and system behavior
• Capability to balance risk, stability and speed of delivery
• Decision-making in high-pressure production situations
• Continuously improve monitoring, alerting and operational processes
C) Communication and Interpersonal Skills
• Strong ability to manage cross-functional teams and vendors
• Effective communication with business, leadership and technical stakeholders
• Ability to handle critical incident communication calmly and clearly
D) Governance and Compliance
• Proficiency in establishing IT governance frameworks and ensuring compliance.
• Ability to generate and present detailed reports for regulatory bodies
Qualifications Education and Experience
• Bachelor's or master's degree in computer science, Information Technology, Engineering, or equivalent.
• 12+ years of experience in Application Production Support & Technology Operations leadership with strong exposure to:
o AWS cloud services, Kubernetes, Database services & API architecture understanding o Observability stack (APM, ELK, Prometheus, Grafana, Jaeger)
o Incident, Problem & Change management
o Production stability & release governance
o Improving MTTR, MTTD and Operational maturity
o Strong experience in digital platforms, cloud-native architectures and regulatory environments.
• Relevant certifications preferred:
o Cloud: AWS or Azure. o ITIL/ITSM frameworks o Observability & DevOps
Location Mumbai - Powai (work from office)
Job ID: 148661391
We don’t charge any money for job offers