Job Title: Data Analytics
Location: India Offshore
Employment Type: Full-time
As an SRE in the Analytics domain, you will bridge the gap between data engineering, observability, and reliability operations helping teams make data-driven decisions to improve availability, performance, and operational excellence.
Key Responsibilities
- Design, build, and maintain data pipelines and analytics platforms that deliver actionable insights into system performance, availability, and reliability.
- Develop dashboards and visualizations (Grafana, Tableau, Looker, etc.) to monitor key SRE metrics such as SLOs, SLIs, error budgets, and capacity trends.
- Partner with SREs, DevOps, and product teams to analyze incident data, automate reporting, and identify recurring reliability issues.
- Support data ingestion and aggregation from multiple sources (CloudWatch, Prometheus, ELK, Datadog, Splunk, etc.) into a unified analytics layer.
- Build and maintain automation scripts and data models for performance and reliability insights.
- Drive post-incident analysis and trend reporting to inform long-term reliability improvements.
- Collaborate with business stakeholders to translate operational data into measurable reliability KPIs.
- Ensure data quality, governance, and availability for all reliability-related analytics systems.
Qualifications
Required Skills:
- Bachelor's or master's degree in computer science, or related field.
- 3+ years of experience as an SRE, DevOps Engineer, or Data Engineer in large-scale, cloud-native environments.
- Strong experience with AWS, GCP, or Azure platforms.
- Hands-on experience with data analytics tools (e.g., Python, SQL, Pandas, PySpark).
- Familiarity with monitoring and observability stacks (e.g., Prometheus, Grafana, ELK, Datadog).
- Strong understanding of SRE concepts SLIs, SLOs, error budgets, and incident management.
- Excellent problem-solving, analytical, and communication skills.
Preferred Skills:
- Experience with data visualization tools (Grafana, Power BI, Tableau, Looker).
- Familiarity with machine learning for anomaly detection in reliability data.
- Experience automating reporting and dashboards using AWS Glue, Athena, or Lambda.
- Knowledge of CI/CD pipelines and Infrastructure as Code (Terraform, CloudFormation).
- Familiarity with ITSM/CMDB systems and integrating reliability analytics into operational workflows.