Lead Operations Engineer
Experience: 8+ years
- Own operational oversight for services running on a Java-based microservices platform. Act as the primary escalation point for production incidents; lead incident response and communication.
- Drive post-incident reviews (blameless RCAs) and embed learnings through preventive actions. Maintain service dashboards, alerts, and incident tooling (e.g., PagerDuty, Datadog).
Technical Leadership
- Guide operational practices across services built using Java (Spring Boot), Kafka, MongoDB and related technologies.
- Oversee monitoring, observability, and performance tuning using Datadog, ELK, Prometheus, or similar tooling.
Problem Management & Root Cause Elimination
- Lead proactive and reactive problem management efforts. Identify recurring production issues and collaborate with engineering to design permanent solutions.
- Track and reduce operational toil via automation and tooling improvements.
Change Enablement & Service Onboarding
- Partnerwith development teams to onboard new services with production readiness standards.
- Ensure all services meet requirements for monitoring, logging, documentation, support, and resilience before go-live.
- Support safe, rapid change practices including canary releases, feature flags, and progressive delivery.
Team Management & Leadership
- Lead and mentor a team of operations engineers and/or SREs.
- Manage performance reviews, career development, and day-to-day team workload.
- Foster a high-performance culture with strong accountability, collaboration, and a learning mindset.
Continuous Improvement & DevOps Practices
- Drive automation and self-service initiatives to reduce manual intervention and operational burden.
- Champion observability best practices (metrics, traces, logs) and error budget tracking. Promote DevOps culture and continuous feedback loops between engineering and operations.
Governance, Risk & Compliance
- Ensure operational processes comply with security, privacy, and regulatory requirements (e.g., SOC 2, ISO 27001). Manage operational risks, service continuity plans, and audit readiness.