Job Description
We are seeking we are dedicated to delivering high-quality managed services that ensure the reliability, scalability, and efficiency of our systems. We are seeking an experienced Site Reliability Engineer (SRE) Staff Software Engineer to join our Managed Services Team. This role is crucial for maintaining the reliability and efficiency of our services, ensuring that systems are resilient and can scale seamlessly.
Responsibilities
- Work closely with development, operations, and product teams to ensure monitoring solutions align with business goals.
- Create and maintain scripts and automation tools to streamline monitoring and alerting processes
- Produce and maintain clear documentation on monitoring setups, best practices, and troubleshooting procedures.
- Train team members and stakeholders on effective use and management of Datadog tools and features.
- Monitor the performance and availability of software systems, identify and resolve issues, and implement proactive measures to prevent future incidents.
- Design and maintain fault-tolerant architectures using redundancy, load balancing, and automated failover mechanisms to minimize downtime and ensure seamless service availability.
- Develop and implement automation strategies to reduce manual intervention and improve system reliability.
- Optimize system performance through proactive monitoring and tuning.
- Prepare and execute disaster recovery plans to ensure business continuity.
- Work closely with development and operations teams to bridge the gap between them, ensuring smooth deployment and operation of applications.
Incident Management
- Follow incident management process, ensuring timely resolution and minimizing service disruptions.
- Conduct root cause analysis and implement preventive measures to reduce recurring incidents.
- Develop and maintain incident response procedures and communication protocols.
Change Management
- Manage the change management process, ensuring controlled and efficient implementation of changes
- Assess the impact of proposed changes and mitigate potential risks.
- Ensure compliance with change management policies and procedures.
Metrics And Eporting
- Generate regular reports and dashboards to provide insights into service performance.
- Use data-driven insights to identify trends and drive continuous improvement.
Transformation And Automation
- Identify opportunities for process automation and implement solutions to improve efficiency.
- Evaluate and implement new monitoring tools
Key Requirements
- Proven expertise in multiple monitoring tools
- Minimum of 8 years of experience in monitoring and DevOps skills.
- Proficiency in scripting, coding and software development principles
- Strong understanding of IT operations and system management.
- Strong experience with automation tools and frameworks.
- Excellent troubleshooting and problem-solving skills.
- Effective communication skills to collaborate with cross-functional teams.
- Proven experience in incident management, change management, and problem management.
- Strong understanding of ITIL frameworks and best practices.
- Proven expertise in Datadog instrumentation and monitoring.
- Implement and manage Datadog instrumentation for infrastructure, APM, synthetic monitoring, database monitoring, and RUM.
- Experience with Cloud Platforms: Familiarity with cloud services such as AWS, Azure, or Google Cloud.
- Experience on tools like Docker and Kubernetes for managing containerized applications.
- Experience with monitoring and logging solutions such as Prometheus, Grafana, and the ELK stack (Elasticsearch, Logstash, Kibana).
- Proficiency in scripting and automation frameworks to streamline operations and improve system reliability.
- Understanding security principles and practices to ensure the integrity and safety of systems.
- Collaborate with cross-functional teams to ensure comprehensive monitoring coverage.
- Develop and maintain Terraform scripts for Datadog configuration.
- Design and implement CI/CD pipelines for Datadog integrations.
- Provide expertise in other monitoring tools and concepts.
- Expertise in creating Datadog dashboards, monitors, and log pipelines.
Qualifications
Must have Skills:
- Excellent analytical and troubleshooting skills to diagnose and resolve complex issues.
- Effective communication skills to collaborate with cross-functional teams and convey technical information clearly.
- Ability to thrive in a fast-paced environment, managing multiple tasks and projects simultaneously.
- Previous experience in a similar role or relevant industry experience is highly preferred. Knowledge of cloud platforms like AWS, Azure, or Google Cloud
About Us
At Zensar, we're
experience-led everything. We are committed to conceptualizing, designing, engineering, marketing, and managing digital solutions and experiences for over 130 leading enterprises. We are a company driven by a bold purpose:
Together, we shape experiences for better futures. Whether for our clients, our people, or the world around us, this belief powers everything we do. At the heart of our culture is
ONE with Client - a set of four core values that reflect who we are and how we work:
One Zensar, Nurturing, Empowering, and Client Focus.
Part of the $4.8 billion RPG Group, we're a community of 10,000+ innovators across 30+ global locations, including Milpitas, Seattle, Princeton, Cape Town, London, Zurich, Singapore, and Mexico City. Explore Life at Zensar and join us to Grow. Own. Achieve. Learn. to be the best version of yourself.
We believe the best work happens when individuality is celebrated, growth is encouraged, and well-being is prioritized. We are an equal employment opportunity (EEO) and affirmative action employer, committed to creating an inclusive workplace. All qualified applicants will be considered without regard to race, creed, color, ancestry, religion, sex, national origin, citizenship, age, sexual orientation, gender identity, disability, marital status, family medical leave status, or protected veteran status.