Search by job, company or skills

bima sugam india federation

Information Technology Operations Lead

12-14 Years
Save
new job description bg glownew job description bg glow
  • Posted 2 days ago
  • Be among the first 10 applicants
Early Applicant

Job Description

The Technology Operations Lead owns and drives Production stability, observability and operational governance of digital platforms/application ensuring seamless functioning of platform in production environments.

Title :

Technology Operations Lead

Position Objectives

The role will act as the single point of ownership for production operations, responsible for incident & problem management, change release governance, observability effectiveness, while ensuring alignment with business SLAs, regulatory requirements and enterprise standards.

Indicative Responsibilities

1. Application Production Support

• Own availability, reliability & performance of production business application/digital platform

• Own operational acceptance of applications before they go live; Ensure readiness across: Support model (L1/L2/L3), Documentation and runbooks, Capacity and performance baselines, DR and backup readiness

• Sign-off from Application owner, that applications are fit for production support

• Ensure adherence to SLA, uptime and performance benchmarks

• Maintain end-to-end visibility across application and infrastructure layers

• Govern capacity planning especially for peak loads and business events

• Participate DR drills and failover testing

• Support vulnerability remediation prioritization

2. Incident Management (Command & Control)

• Act as Incident Commander for P1 incidents - Drive war rooms, triage and cross-team coordination (App & Infra)

• Ensure rapid restoration of services and structured internal stakeholder communication across teams • Track and reduce incident frequency and impact

• Ensure incidents are logged, tracked, categorized, and closed as per ITSM processes

3. Problem Management & RCA Governance

• Validate the quality, depth, and accuracy of RCAs provided by internal teams and vendors/partners.

• Ensure permanent fixes and prevention of recurring issues

• Maintain and track problem backlog and corrective actions

4. Change & Release Governance

• Participate in change and release governance from a production stability perspective

• Review production readiness for releases, including Rollback and recovery plans, monitoring and alerting readiness, support runbooks and escalation models

• Approve/reject changes based on change process completeness

• Ensure controlled and stable release cycle

5. Observability & Monitoring Governance

• Govern (Application Performance Monitoring) APM & metrics - Maintain visibility across application and infrastructure dependencies

• Contribute to enhancing infrastructure monitoring frameworks.

• Improve alert quality, reduce noise and ensure actionable monitoring

• Enable proactive detection of issues

6. Vendor Management & Governance

• Manage vendor partners for production operations

• Ensure adherence to SLA, response timelines and quality standards

• Prevent blame shifting and enforce clear ownership & accountability

• Drive performance reviews and escalation management

• Seek monthly and quarterly operations health reports

• Own and validate production operations dashboards shared by partner/vendor covering availability, incidents, business journeys, change stability and observability effectiveness

8. Continuous Improvement & Operational Excellence

• Identify patterns in incidents and performance issues

• Drive process improvements and operational maturity

• Improve MTTD, MTTR and overall system reliability

Reports To Head Infrastructure

Coverage / Sub functions

• Technology Operations – Production Application Stability & Performance

• Incident & Problem Management

• Change Release Management

• Observability & Monitoring Governance

• Operational Readiness & Business Continuity

Key Skills & Competencies

A) Technical Skills

• Hands-on understanding of AWS cloud services including Kubernetes, containerized application platforms and distributed systems distributed systems concepts (timeouts, retries, partial failures, and cascading impact)

• Operational understanding of Storage & Database services (including RDS, Aurora, Document DB, etc)

• Strong understanding of application architectures, APIs, and microservices-based platforms

• Ability to trace end-to-end request flows across multiple services

• Ability to correlate logs, metrics, and traces to diagnose production issues

• Knowledge of observability tools (APM, ELK/OpenSearch, Prometheus, Grafana, Jaeger)

• Experience in incident, problem and change management (ITIL practices)

• Understanding of infrastructure and system dependencies

• Ability to analyze and troubleshoot cloud-specific failure patterns such as throttling, saturation, connectivity issues, and regional dependencies

B) Strategic Thinking and Problem-Solving

• Ability to analyze infrastructure challenges and propose reliable and scalable solutions.

• Ability to drive end-to-end issue resolution across multiple domains

• Strong analytical approach to incident trends and system behavior

• Capability to balance risk, stability and speed of delivery

• Decision-making in high-pressure production situations

• Continuously improve monitoring, alerting and operational processes

C) Communication and Interpersonal Skills

• Strong ability to manage cross-functional teams and vendors

• Effective communication with business, leadership and technical stakeholders

• Ability to handle critical incident communication calmly and clearly

D) Governance and Compliance

• Proficiency in establishing IT governance frameworks and ensuring compliance.

• Ability to generate and present detailed reports for regulatory bodies

Qualifications Education and Experience

• Bachelor's or master's degree in computer science, Information Technology, Engineering, or equivalent.

• 12+ years of experience in Application Production Support & Technology Operations leadership with strong exposure to:

o AWS cloud services, Kubernetes, Database services & API architecture understanding o Observability stack (APM, ELK, Prometheus, Grafana, Jaeger)

o Incident, Problem & Change management

o Production stability & release governance

o Improving MTTR, MTTD and Operational maturity

o Strong experience in digital platforms, cloud-native architectures and regulatory environments.

• Relevant certifications preferred:

o Cloud: AWS or Azure. o ITIL/ITSM frameworks o Observability & DevOps

Location Mumbai - Powai (work from office)

More Info

Job Type:
Industry:
Employment Type:

Job ID: 148661391