Search by job, company or skills

L

Observability & Monitoring Lead

Save
new job description bg glownew job description bg glow
  • Posted 14 days ago
  • Be among the first 10 applicants
Early Applicant

Job Description

Project Description:

Support clients in the operation, maintenance, and optimization of Oracle Cerner EHR environments. This role is designed for early-career professionals who are eager to grow their technical skills in healthcare IT while working under the mentorship of experienced consultants and technical leaders. You will gain hands-on exposure to Cerner infrastructure, system workflows, and healthcare technology best practices while contributing to meaningful client outcomes.

Responsibilities:

Trend Analysis & Problem Identification

- Identify recurring incident patterns, anomalies, and signs of alert fatigue that may indicate deeper systemic issues.

- Collaborate with L2/L3 teams to review telemetry data and recommend improvements to alert thresholds, rules, and policies.

- Provide insights that support proactive issue prevention, noise reduction, and overall monitoring refinement.

2. Platform Management & Optimization

- Develop, update, and maintain dashboards that reflect realtime system health, performance metrics, and service behavior.

- Support the ongoing adoption and optimization of Dynatrace, enhancing dashboarding and visualization capabilities for cloud and onprem observability.

- Assist in routine platform checks, ensuring monitoring tools remain accurate, stable, and aligned with business and operational requirements.

3. Leadership & Collaboration

- Responsible for organizing the work for the team, including planning, task breakdown, and ensuring clarity of priorities.

- Provide structured, timely updates to leadership on progress, risks, blockers, team capacity, and delivery timelines.

- Work closely with application teams, SRE groups, and infrastructure operations during incident triage, investigations, and routine monitoring reviews.

- Ensure clear, timely, and effective communication with stakeholders during service-impacting events, providing status updates and context as needed.

- Ensures adherence to engineering best practices, drives operational excellence, and maintains accountability for team delivery outcomes

4. Operational Excellence

- Support platform stability and availability through adherence to lifecycle maintenance, patching schedules, and vulnerability management processes.

- Contribute to the improvement of monitoring workflows, alert routing logic, runbook effectiveness, and incident management practices.

5. Innovation & AI Enablement

- Assist in exploring and adopting AI-driven capabilities that improve observability, automate rootcause identification, and reduce manual effort.

- Contribute to internal knowledge sharing by documenting best practices, playbooks, AI reference materials, and usage guidelines (e.g., Copilot tips).

6. Collaboration & Leadership Support

- Partner with cross-functional teams to align monitoring practices with evolving business needs and operational priorities.

- Drive end-to-end delivery of monitoring initiatives—requirements gathering, planning, execution oversight, and delivery validation.

- Coordinate crossteam dependencies, ensure timelines are met, and proactively remove blockers for the team.

- Provide subjectmatter support for ITSM processes including incident, problem, and change management discussions.

Mandatory Skills:

NewRelic

Mandatory Skills Description:

- 6+ years in Site Reliability Engineering or Observability/Monitoring engineering roles.

- 5+ years hands-on with monitoring/observability tools: New Relic, SolarWinds ,WUG

- 4+ years of scripting experience (JavaScript, Java, PowerShell, or others)

- 2+ years with Azure (architecture fundamentals, observability in cloud-native and liftandshift contexts).

- 4+ year scripting with Python and Bash or PowerShell for automation.

- Experience troubleshooting complex distributed applications, leading/participating in war rooms, and performing codelevel impact analysis (read logs/stack traces, correlate with deploys and infra changes).

- Solid understanding of observability best practices (metrics, logs, traces), ITSM processes, and alert hygiene.

- Have the mindset of automate any task

- Maintain associated documentation as it applies to our audit and certification requirements

- Ensure platform stability, availability, and compliance through proactive vulnerability management and lifecycle maintenance

- Drive process improvements for monitoring workflows and incident management

- Participate in troubleshooting, capacity planning, and performance analysis activities

- Research new monitoring requirements and in many cases write code for that

- Solid expertise in setting up monitoring policies/rules/templates; and writing scripts to accomplish monitoring requirements

- Excellent problem solving, communication, and crossteam collaboration skills.

More Info

Job Type:
Industry:
Function:
Employment Type:

About Company

Job ID: 148222619