AIOps Architecture, IT Service Management (ITSM), Observability & Monitoring, Machine Learning, Site Reliability Engineering (SRE)
Description
GSPANN is hiring an AIOps Architect to design and lead enterprise AIOps architecture across observability, incident management, automation, and autonomous remediation. The role focuses on integrating Machine Learning, IT Service Management (ITSM) platforms, and Site Reliability Engineering (SRE) practices to build scalable, self-healing operational ecosystems.
Location: Hyderabad
Role Type: Full Time
Published On: 3 March 2026
Experience: 10+ Years
Share this job
Description
GSPANN is hiring an AIOps Architect to design and lead enterprise AIOps architecture across observability, incident management, automation, and autonomous remediation. The role focuses on integrating Machine Learning, IT Service Management (ITSM) platforms, and Site Reliability Engineering (SRE) practices to build scalable, self-healing operational ecosystems.
Role and Responsibilities
- Design and implement end-to-end AIOps architecture covering observability, incident lifecycle management, anomaly detection, root cause analysis (RCA), and autonomous remediation.
- Define and maintain AIOps strategy, reference architecture, governance frameworks, and operational blueprints.
- Architect agentic and generative AI patterns for IT Operations, Data Operations, and Platform Operations.
- Design unified observability frameworks spanning logs, metrics, traces, alerts, and events.
- Build scalable event ingestion, correlation, and anomaly detection pipelines using ML and AI models.
- Develop confidence-scored, controlled auto-remediation and auto-correction workflows.
- Create orchestration layers to enable self-healing infrastructure, applications, and data pipelines.
- Integrate runbooks, standard operating procedures (SOPs), and AI-driven virtual agents to automate L1 and L2 operations.
- Integrate AIOps platforms with ITSM tools (ServiceNow, Jira), Configuration Management Database (CMDB), asset inventory systems, and enterprise observability stacks.
- Collaborate with Data Engineering, Cloud, Infrastructure, Security, and Application teams to operationalize AIOps capabilities.
- Align AIOps solutions with SRE principles, ITSM processes, and enterprise reliability objectives.
- Serve as a trusted technical advisor to leadership, contributing to roadmaps, QBRs, and transformation programs.
- Mentor engineering and operations teams on AIOps architecture, automation, and observability best practices.
- Drive adoption through documentation, playbooks, governance models, and enablement initiatives.
Skills And Experience
- 10+ years of experience in IT Operations, SRE, DevOps, or related domains, including 35 years architecting AIOps solutions.
- Hold certifications in Cloud (AWS/Azure/GCP), SRE, ITIL, Observability platforms, or related disciplines (preferred).
- Demonstrate strong hands-on experience with AIOps platforms, observability stacks, and monitoring ecosystems such as Prometheus, Grafana, Elastic, Dynatrace, or New Relic.
- Possess experience operationalizing ML models through MLOps practices.
- Apply deep understanding of logs, metrics, traces, event pipelines, and distributed system architectures.
- Design and implement Machine Learning, Generative AI, and agent-based architectures for operational automation.
- Build anomaly detection models, predictive alerting systems, and advanced RCA frameworks.
- Develop event correlation engines and automated remediation workflows.
- Possess strong expertise in infrastructure (compute, storage, networking), cloud platforms (AWS, Azure, GCP), and Kubernetes ecosystems.
- Apply DevOps, CI/CD, SRE, and automation frameworks in enterprise environments.
- Integrate AIOps capabilities with ITSM and CMDB platforms using enterprise integration patterns.
- Design scalable, resilient, modular, and secure AIOps architectures.
- Create reference architectures, governance models, and operational blueprints for enterprise adoption.
- Apply data engineering principles, including ETL/ELT pipelines and data quality frameworks.
- Lead engineering transformation initiatives and drive operational excellence across complex environments.
- Demonstrate strong analytical thinking, problem-solving, and stakeholder management skills.