Location: Vadodara, Gujarat (Onsite)
Experience: 24 Years
Employment Type: Full-time
Role Overview
Air Render Farm is building a high-performance GPU compute platform for distributed rendering and large-scale compute workloads. We are looking for a DevOps Engineer who can work across multiple cloud environments and automate infrastructure at scale.
The ideal candidate should be comfortable working with AWS, Google Cloud (GCP), and Microsoft Azure, and capable of adapting to new infrastructure tools such as SkyPilot for multi-cloud compute orchestration.
This role will involve building scalable infrastructure, automating deployments, managing GPU workloads, and improving platform reliability.
Key Responsibilities
Multi-Cloud Infrastructure
- Deploy and manage infrastructure across AWS, GCP, and Azure.
- Build cloud-agnostic infrastructure workflows to avoid vendor lock-in.
- Configure and manage compute instances, networking, storage, and security policies.
Cluster & Compute Management
- Provision and manage compute clusters for large-scale workloads.
- Work with distributed compute orchestration tools such as SkyPilot.
- Optimize infrastructure for high-performance workloads like rendering and batch compute jobs.
Infrastructure Automation
- Implement Infrastructure as Code (IaC) using tools such as Terraform.
- Automate provisioning, scaling, and management of compute resources.
CI/CD & Deployment
- Build and maintain CI/CD pipelines for backend services and infrastructure.
- Automate build, testing, and deployment processes.
Monitoring & Reliability
- Implement monitoring, logging, and alerting systems.
- Troubleshoot infrastructure and deployment issues.
- Improve system reliability and performance.
Cost Optimization
- Monitor cloud usage and optimize infrastructure costs.
- Implement strategies such as autoscaling, spot instances, and efficient workload scheduling.
Must-Have Skills
- Strong experience with Windows/Linux/MacOS and command-line operations.
- Hands-on experience with at least one major cloud platform (AWS, GCP, or Azure) and the ability to adapt to others.
- Experience with Infrastructure as Code tools such as Terraform.
- Solid understanding of Docker and containerization.
- Basic working knowledge of Kubernetes or container orchestration systems.
- Proficiency in Bash or Python scripting for automation.
- Understanding of cloud networking fundamentals (VPCs, subnets, firewalls, load balancers).
- Strong troubleshooting skills for distributed infrastructure environments.
What We Look For
- Strong problem-solving and debugging skills.
- Ability to learn new infrastructure tools quickly.
- Comfortable working in a fast-paced startup environment.
- Interest in high-performance computing, GPU infrastructure, and distributed systems.