We are seeking an experienced
SRE Observability Engineer to join a global Monitoring and Observability team responsible for building, scaling, and maintaining enterprise-grade observability solutions. This role focuses on modernizing monitoring platforms, driving end-to-end observability strategy, and enabling data-driven decision-making across large-scale distributed systems.
The ideal candidate brings deep expertise in cloud-native technologies, observability tools, and automation, along with strong collaboration and communication skills to influence technical and business stakeholders.
Responsibilities
- Operate within a globally distributed environment, supporting large-scale systems.
- Collaborate with cross-functional teams to design and implement observability solutions for enterprise-wide adoption.
- Manage and enhance legacy monitoring platforms while contributing to modernization initiatives.
- Drive the implementation of end-to-end observability solutions across metrics, logs, and traces.
- Analyze complex system behaviors and provide insights to solve performance and reliability challenges.
- Influence strategic decisions by providing technical guidance and recommendations.
- Communicate effectively with stakeholders and promote best practices in observability and SRE.
- Contribute to documentation, knowledge sharing, and continuous improvement initiatives.
- Perform additional duties as required to support operational excellence.
Requirements
- Experience: 6–15 years in Site Reliability Engineering, DevOps, or Observability Engineering.
- Strong experience with OpenShift/Kubernetes administration, including deployment, troubleshooting, resource management, and networking.
- Hands-on expertise with Grafana and observability ecosystems, including:
- Grafana administration (dashboards, alerts, data sources, user management)
- Experience with Prometheus and PromQL
- Working knowledge of backend components such as Mimir (metrics), Loki (logs), and Tempo (traces)
- Experience with enterprise monitoring tools such as Geneos ITRS or similar
- Experience with Helm charts for application deployment and management (including dependencies and customization).
- Strong scripting and automation skills using Bash or Python.
- Ability to create clear, concise, and well-structured technical documentation.
- Excellent analytical, problem-solving, and communication skills.
Nice to have
- Experience with application deployment platforms such as Lightspeed Enterprise (or similar).
- Exposure to Google Cloud Platform (GCP) operations and services.
- Familiarity with modern cloud-native observability frameworks and practices.
- Experience in large-scale enterprise environments with distributed systems.
We offer
- Opportunity to work on bleeding-edge projects
- Work with a highly motivated and dedicated team
- Competitive salary
- Flexible schedule
- Benefits package - medical insurance, sports
- Corporate social events
- Professional development opportunities
- Well-equipped office
About Us
Grid Dynamics (NASDAQ: GDYN) is a leading provider of technology consulting, platform and product engineering, AI, and advanced analytics services. Fusing technical vision with business acumen, we solve the most pressing technical challenges and enable positive business outcomes for enterprise companies undergoing business transformation. A key differentiator for Grid Dynamics is our 8 years of experience and leadership in enterprise AI, supported by profound expertise and ongoing investment in data, analytics, cloud & DevOps, application modernization and customer experience. Founded in 2006, Grid Dynamics is headquartered in Silicon Valley with offices across the Americas, Europe, and India.