Search by job, company or skills

HireAlpha

SRE / DevOps Engineer(Prometheus/ Grafana to Datadog Migration)

new job description bg glownew job description bg glownew job description bg svg
  • Posted 4 hours ago
  • Be among the first 10 applicants
Early Applicant

Job Description

Role: SRE / DevOps Engineer(Prometheus/ Grafana to Datadog Migration)

Location: Bangalore (Work From Office)

Experience Required: 5+ Years

Employment Type: Contractual

We estimate an initial period of 3 months, which can be extended based on performance and project conditions.

Interview Process:

Technical Screening + Technical Assessment

Experience Required:

Must Have:

- Atleast 5 years of relevant experience in working on Observability stack as defined above.

- Has managed and operated Datadog Platform.

- Strong communication skills to interact with global teams.

- Fundamental knowledge of working and operating on AWS using IAC practices.

1. Migration projects from our current technologies to Datadog (please focus on the number of dashboards and alerts migratedthese should be in the thousandsand note the high level of interaction with engineering teams, and automated, so not done manually).

2. Expertise or high level familiarity with Terraform and Ansible, specifically the ability to install, roll out, and troubleshoot Datadog agents, and to understand, read and debug issues related to Datadog and/or the migration process.

3.Must have very good/exceptional communication skills. This person will need to interact with engineers and debate possible solutions with them.

4.Must have very good knowledge of Datadog and migrations to Datadog from Prometheus/Grafana specifically. Ideally, several migrations must have been completed using automation, not manually.

5.Can have knowledge about Ansible and Terraform.

Beginning March, we need to start a new project for migration of our Observability Infra Stack from self hosted AWS ( Prometheus/Grafana, Loki,Mimir) to Datadog Solution ( SAAS).

The good resources that will focus on Engineering deliverables set by the organization SRE Team for migration.

SKILLS:

1. Working Knowledge of Prometheus and PromQL:

- Ability to read, understand, and modify existing PromQL queries, dashboards, and alerting rules, including common aggregations and label usage.

2. Grafana and Alertmanager Familiarity:

- Experience navigating Grafana dashboards and Alertmanager configurations to understand intent, thresholds, and alert routing.

3. Datadog Dashboarding and Monitors

- Hands-on experience creating Datadog dashboards and monitors based on defined requirements, using existing patterns and guidance.

4. Query and Alert Semantics Translation

- Ability to accurately map PromQL queries and Alertmanager rules to Datadog equivalents, recognising non-1:1 translations, validating statistical correctness, and documenting functional differences where exact parity is not possible.

5. Observability Concepts

- Understanding of metrics vs logs vs traces, alert thresholds, and standard monitoring practices in production environments.

6. Team Collaboration

- Ability to work with engineering teams to validate migrated dashboards and alerts, following structured validation checklists.

7. Clear Execution and Documentation

- Documenting migrated assets, assumptions, and validation outcomes in a consistent, predefined format.

8. Automation Skills

- Proficient is building tooling using python to reduce engineering toil for these migration activities.

Nice to Have:

- AWS Administrator Certifications.

More Info

Job Type:
Industry:
Employment Type:

About Company

Job ID: 144180599