Search by job, company or skills

dexian india

AI/ML Observability Engineer

Save
new job description bg glownew job description bg glownew job description bg svg
  • Posted 4 hours ago
  • Be among the first 10 applicants
Early Applicant

Job Description

Overview

We are seeking a passionate and hands-on AI/ML Engineer to accelerate our Enterprise Observability strategy. This role will design, build, and operationalize AI/ML capabilities that enhance end to end telemetry pipelines, anomaly detection, intelligent alerting, and proactive system resiliency.

You will work at the intersection of AI/ML engineering, Observability platforms, and automation, developing solutions that improve detection, diagnosis, and prevention of operational issues across distributed systems.

________________________________________

Key Responsibilities

• Design and deploy AI/ML models supporting anomaly detection, baselining, event correlation, and predictive operational analytics.

• Build and integrate AI‑enabled capabilities into enterprise Observability platforms, including Grafana, APM/RUM tools, network telemetry systems, and data observability tools.

• Develop AI Agents that can autonomously triage issues, recommend corrective actions, and initiate automated remediation workflows to reduce recovery time and improve system resilience.

• Implement self‑healing automation using AI‑driven decisioning, integrating with orchestration frameworks, service APIs, and infrastructure automation pipelines.

• Engineer and maintain real‑time and batch data pipelines using Snowflake ML Jobs, Snowflake Cortex, streams, tasks, and UDFs.

• Implement and manage OpenTelemetry‑based telemetry ingestion for logs, metrics, traces, and spans across distributed systems.

• Build asynchronous Python APIs and services for model inferencing and operational integration.

• Enhance observability intelligence with AI-powered capabilities such as root‑cause acceleration, chatbot/search enablement, and automated insights.

• Contribute to SLO/SLI modeling, Golden Signals instrumentation, and Observability NFR adoption.

• Collaborate across engineering, SRE, platform and business teams to embed proactive intelligence and Observability standards throughout the ecosystem.

Required Skills & Qualifications

Core Technical Skills

• Strong proficiency in Python and data science/ML libraries:

NumPy, Pandas, scikit learn, TensorFlow, PyTorch, Matplotlib, Seaborn.

• Experience with Generative AI, LLM fine tuning, prompt engineering, RAG pipelines, and LLM evaluation frameworks.

• Expertise in developing and deploying ML models in production (batch & streaming).

• Strong understanding of statistics, time series modeling, and anomaly detection.

Observability & Telemetry

• Experience with OpenTelemetry for logs, metrics, traces, spans.

• Familiarity with Observability concepts:

Golden Signals, SLO/SLI design, APM, RUM, Synthetics, event correlation, baselining.

• Experience with Observability tools such as:

Grafana (Alloy agents, dashboards, ML capabilities), Dynatrace, Monte Carlo (Data Observability), Netscout, ThousandEyes, SolarWinds, NetBrain.

Cloud, Data & Platform

• Hands on with AWS (SageMaker, Bedrock), Snowflake ML, Snowflake/Openflow, Snowflake AI Observability tooling.

• Experience building Snowflake data pipelines (streams, tasks, UDFs) – plus for Cortex features.

• Strong understanding of distributed systems and microservices telemetry requirements.

Automation & Engineering Quality

• Experience with automation pipelines, CI/CD, and infrastructure as code patterns supporting Observability adoption.

• Ability to build asynchronous Python APIs or services for model inference and operational integration.

________________________________________

Preferred Qualifications

• Experience developing agentic AI systems that analyze telemetry, generate action recommendations, or execute automated operational responses.

• Experience building self‑healing patterns, including automated rollback, service restarts, configuration corrections, and predictive maintenance.

• Experience in Snowflake ML workflows, Snowflake Cortex Agents, and data pipeline automation.

• Exposure to AI-enabled alerting, RCA automation, and operational self‑healing concepts.

• Experience with large-scale operational telemetry and multi-cloud ecosystems.

Soft Skills

• Strong analytical thinking and problem solving.

• Excellent communication skills for cross functional collaboration with infrastructure, SRE, engineering, business, and leadership teams.

• Curiosity, continuous learning mindset, and passion for applied AI and Observability.

More Info

Job Type:
Industry:
Employment Type:

About Company

Job ID: 147128805

Similar Jobs