Job Description
We are seeking a Lead Integration & Observability Specialist to design, implement, and lead enterprise observability and reliability solutions, while supporting cloud-based integration platforms on AWS/Azure. The role focuses on monitoring, automation, and operational readiness of applications, APIs, data pipelines, and messaging systems.
This is a hands-on technical leadership role with mentoring and solution ownership responsibilities.
Key Responsibilities
- Lead the implementation of enterprise observability for applications, APIs, services, batch jobs, and data pipelines.
- Design and standardize monitoring, alerting, logging, metrics, and health checks across distributed systems.
- Integrate observability platforms with incident management and automation tools to support proactive issue detection and remediation.
- Support reliability and availability of integration platforms built on AWS/Azure
- Perform advanced troubleshooting using logs, metrics, and traces to resolve production issues.
- Define operational readiness standards and non-functional requirements.
- Mentor engineers on observability best practices and platform usage.
- Collaborate with product, support, and operations teams to improve service stability and delivery.
Required Skills (Mandatory)
- 7+ years of overall IT experience
- 5+ years of relevant experience in Observability / Monitoring / Reliability Engineering
- Strong hands-on experience with enterprise observability tools, such as:
IBM Instana, Dynatrace, AppDynamics, Prometheus, Grafana
Expertise in:
- Monitoring and alerting design
- Log management and analysis
- Metrics and distributed tracing
- Health checks and SLO/SLI concepts
- Experience monitoring AWS/Azure workloads
- Strong troubleshooting and incident analysis skills
- Experience defining operational and non-functional requirements
- Technical leadership and mentoring experience
- Automation and ITSM integration (ServiceNow workflows, incident automation)
- CI/CD and release management exposure
- Cloud integration and messaging exposure
- Automation and ITSM integration (ServiceNow workflows, incident automation)