We're Hiring: DevOps Manager
Location: Nanakramguda, Hyderabad
Employment Type: Full-Time
Experience: 6+ Years
At Coschool, we're building AI-powered learning solutions that create real-world impact. If you enjoy solving complex infrastructure problems and want to scale systems that support Generative AI, this is your opportunity.
- Make an Impact: Join a fast-growing startup with a go-to-market product and help scale our platform from the ground up.
- Work on Cutting-Edge Tech: Be at the forefront of LLMOps, MLOps, and AI-driven systems
- Grow with the Company: Thrive in a high-ownership environment that values learning, adaptability, and leadership.
- Innovative Culture: Collaborate with a team that encourages experimentation and bold problem-solving.
What You'll Do
- Own production stability, uptime, and reliability across applications and infrastructure
- Lead incident management, on-call rotations, and post-incident reviews
- Design and maintain CI/CD pipelines using Jenkins and GitHub Actions
- Manage and optimize AWS infrastructure (EC2, EKS, ALB/NLB, Lambda, API Gateway, Cognito, SNS/SES, ElastiCache)
- Build and operate containerized platforms using Docker and Kubernetes (EKS)
- Define and monitor SLIs/SLOs aligned with business outcomes
- Implement observability using Prometheus, Grafana, and ELK
- Automate infrastructure using Terraform, Pulumi, and Ansible
- Manage production databases (MongoDB, PostgreSQL, MySQL, Redis, Kafka, ClickHouse, Elasticsearch, MilvusDB)
- Ensure HA, DR, security, and cost efficiency across systems
- Reduce operational toil through Python-based automation
- Maintain runbooks, documentation, and operational playbooks
- Mentor DevOps engineers and foster a culture of ownership and accountability
What We're Looking For
- 6+ years of experience in DevOps, SRE, or Infrastructure Engineering
- Strong leadership experience guiding DevOps teams and driving delivery
- Hands-on expertise with AWS/Azure, Jenkins, GitHub Actions
- Proficiency in Docker, Kubernetes, Terraform, Ansible, CloudFormation
- Solid understanding of databases, caching, and messaging systems
- Experience with monitoring & observability tools (Prometheus, Grafana, ELK)
- Strong Python scripting and automation skills
- Deep knowledge of cloud security best practices
- Excellent problem-solving and cross-team collaboration skills
Nice to Have
- Experience with Hadoop, Spark, or Flink
- Exposure to MLOps or LLMOps platforms