Experience: 3.00 + years
Salary: Confidential (based on experience)
Expected Notice Period: 7 Days
Shift: (GMT+05:30) Asia/Kolkata (IST)
Opportunity Type: Remote
Placement Type: Full Time Indefinite Contract(40 hrs a week/160 hrs a month)
(*Note: This is a requirement for one of Uplers client - Strategic Transformation Through Digital & Physical Innovation)
What do you need for this opportunity
Must have skills required:
Grafana, Kubernetes tools, Monitoring tools, Promotheus, Scripting, CI/CD, CockroachDB, Terraforms, AWS, Docker, GCP, Github, Kubernetes
Strategic Transformation Through Digital & Physical Innovation is Looking for:
DevOps Automation Engineer
Dual-Cloud Infrastructure - AWS, Google Cloud, CockroachDB & Tailscale
About The Role
This role spans:
- Dual-cloud infrastructure (AWS + GCP)
- Developer workstation management
- Security automation
- Incident response
- CI/CD pipeline operations
The platform is scaling to serve
millions of consumers across 20,000+ veterinary clinics in 100+ countries, with
tens of millions of projected concurrent sessions.
You must have
real-world experience operating infrastructure at this scale, including:
- Massive server loads
- Database replication and failover at scale
- Disaster recovery
- Performance optimization under heavy concurrent traffic
The current production stack runs on
ECS Fargate with RDS PostgreSQL, with
CockroachDB planned for distributed workloads.
You will manage:
- Tailscale VPN infrastructure
- 20+ EC2 dev boxes
- Cloudflare for 23 zones
- PagerDuty alerting
- A robust security automation layer
This is a
hands-on role with real operational ownership.
What You'll Do
Cloud Infrastructure (AWS + GCP)
- Design, build, and manage dual-cloud infrastructure across AWS and Google Cloud Platform
- Manage ECS Fargate deployments — task definitions, service discovery, ALB target groups, and blue/green deployments
- Automate infrastructure provisioning using Terraform with modular, reusable configurations
- Build and maintain CI/CD pipelines using GitLab CI and GitHub Actions
- Manage containerized applications using Docker, ECS, and Kubernetes (EKS/GKE for planned workloads)
- Support multi-tenant and multi-region application architectures across 6+ global regions
- Implement and maintain CockroachDB clusters for distributed, geo-partitioned data (planned migration from RDS PostgreSQL)
- Implement infrastructure cost optimization through:
- Auto-scaling
- Reserved capacity
- Right-sizing
- Spot instances
- Savings Plans
- Continuously monitor and reduce cloud spend across AWS and GCP
- Optimize database costs through:
- Right-sizing instances
- Storage tiering
- Reserved capacity
- Query performance tuning
Developer Workstation Infrastructure
- Provision and manage 20+ EC2 dev boxes across 3 AWS regions
- Build custom AMIs using Packer for dev boxes and DERP relays
- Deploy and maintain:
- Memory watchdog
- noVNC
- CloudWatch agent configurations
- Run fleet management commands across dev boxes via AWS Systems Manager (SSM)
- Monitor dev box health and performance
Tailscale VPN Administration
- Manage Tailscale ACL policies and user access
- Operate custom DERP relays in 3 regions
- Configure app connectors for SaaS IP lockdown
- Maintain Mullvad VPN integration for egress control
Security Automation
- Own GuardDuty, Security Hub, and AWS Config across all regions
- Manage EventBridge rules for security alert routing
- Build and manage:
- IAM policies
- Secrets management
- WAF
- Zero-trust networking
- Administer GitHub Enterprise security, including:
- Org management
- IP allowlists
- Secret scanning policies
- Runner management
Scale, Performance & Disaster Recovery
- Design and operate infrastructure capable of handling millions of concurrent users and tens of millions of sessions across global regions
- Implement and manage auto-scaling policies, including:
- ECS service auto-scaling
- EC2 ASGs
- RDS read replicas
- Conduct load testing and capacity planning
- Design and maintain database scaling strategies:
- Read replicas
- Connection pooling
- Query optimization
- Sharding for high-throughput workloads
- Own disaster recovery (DR) planning and execution:
- Multi-region failover
- RTO/RPO targets
- Automated recovery runbooks
- Regular DR drills
- Implement and manage database backup strategies:
- Point-in-time recovery
- Cross-region replication
- Automated restore testing
- Optimize CDN and edge caching (Cloudflare) for global traffic at scale
- Monitor and resolve performance bottlenecks across:
- Application servers
- Databases
- Caches
- Network layers
- Build runbooks for incident response during:
- High-traffic events
- Database failovers
- Regional outages
Monitoring, Alerting & Incident Response
- Configure and maintain PagerDuty
- Monitor system performance using:
- Prometheus
- Grafana
- CloudWatch
- Cloud Monitoring
- Manage EBS backup automation, including:
- Daily backups
- 30-day retention
- Cross-region copy
- Vault lock
CI/CD & Repository Operations
- Manage GitLab mirroring from GitHub
- Maintain 45+ cron jobs on the admin box
- Manage Cloudflare across 23 zones, including:
- CDN
- DNS
- WAF configuration
- Collaborate with developers to improve deployment workflows and reduce lead time
AI/ML Infrastructure & Tooling
- Use Claude Code / Cursor for:
- Terraform authoring
- Script generation
- Infrastructure debugging
- Support AI/ML infrastructure, including:
- GPU instance management
- Model deployment pipelines
- Maintain and improve AI-assisted monitoring and alerting
- Support infrastructure requirements for AI-enabled platform capabilities
Must-Have Skills
- 3–5+ years of experience in DevOps / Cloud / Infrastructure Automation at scale
- High-Scale Production Experience (Critical) — must have operated infrastructure serving millions of users with high concurrency
- Experience with:
- Server load management
- Database scaling (read replicas, connection pooling, sharding)
- Auto-scaling policies
- Performance optimization under heavy traffic
- Strong hands-on experience with AWS, including:
- ECS Fargate
- EKS
- Lambda
- S3
- RDS
- CloudFront
- SQS
- IAM
- SSM
- GuardDuty
- Security Hub
- Config
- Working experience with Google Cloud Platform, including:
- GKE
- Cloud Run
- BigQuery
- Cloud Functions
- IAM
- ECS Fargate production experience
- Terraform (Infrastructure as Code) with multi-environment, modular patterns
- Tailscale VPN administration — ACLs, DERP relays, app connectors
- Packer for AMI builds
- Docker and container orchestration in production
- Experience with GitLab CI/CD and/or GitHub Actions
- GitHub Enterprise administration
- Cloudflare CDN/DNS/WAF management
- PagerDuty or equivalent incident response configuration
- Production experience with CockroachDB or distributed SQL databases (or strong willingness to learn)
- Disaster recovery planning and execution, including:
- Multi-region failover
- Backup automation
- RTO/RPO targets
- Recovery runbooks
- Database performance optimization at scale, including:
- Replication
- Connection pooling
- Query tuning
- Capacity planning
- Cost Optimization (Critical) — proven track record of reducing cloud infrastructure costs through:
- Right-sizing
- Reserved capacity
- Spot instances
- Storage tiering
- Waste reduction
- Good understanding of Linux systems, networking, and security fundamentals
- Strong communication skills and ability to work in a remote, globally distributed team
Nice-to-Have Skills
- Experience with Kubernetes tools:
- Helm
- ArgoCD
- Flux
- Experience with monitoring stacks:
- Prometheus
- Grafana
- ELK
- Loki
- AWS Systems Manager fleet management at scale
- Experience working in startup or fast-paced product environments
- Scripting experience:
- Bash
- Python
- Go
- Experience supporting AI/ML workloads and GPU infrastructure
- Experience with chaos engineering tools (Gremlin, Litmus) for resilience testing
- FinOps certification or formal cloud cost management framework experience
What We're Looking For (Mindset)
- Strong ownership and problem-solving mindset
- Comfort working in a fast-growing, evolving environment
- Ability to balance speed with stability and security
- Willingness to learn and adapt to new tools and technologies
- Clear, proactive communicator who surfaces issues early
How to apply for this opportunity
- Step 1: Click On Apply! And Register or Login on our portal.
- Step 2: Complete the Screening Form & Upload updated Resume
- Step 3: Increase your chances to get shortlisted & meet the client for the Interview!
About Uplers:
Our goal is to make hiring reliable, simple, and fast. Our role will be to help all our talents find and apply for relevant contractual onsite opportunities and progress in their career. We will support any grievances or challenges you may face during the engagement.
(Note: There are many more opportunities apart from this on the portal. Depending on the assessments you clear, you can apply for them as well).
So, if you are ready for a new challenge, a great work environment, and an opportunity to take your career to the next level, don't hesitate to apply today. We are waiting for you!