DevOps Automation Engineer

Uplers

Bhubaneswar, India

3-5 Years

Save

Posted a day ago
Be among the first 20 applicants

Early Applicant

Job Description

Experience: 3.00 + years

Salary: Confidential (based on experience)

Expected Notice Period: 7 Days

Shift: (GMT+05:30) Asia/Kolkata (IST)

Opportunity Type: Remote

Placement Type: Full Time Indefinite Contract(40 hrs a week/160 hrs a month)

(*Note: This is a requirement for one of Uplers client - Strategic Transformation Through Digital & Physical Innovation)

What do you need for this opportunity

Must have skills required:

Grafana, Kubernetes tools, Monitoring tools, Promotheus, Scripting, CI/CD, CockroachDB, Terraforms, AWS, Docker, GCP, Github, Kubernetes

Strategic Transformation Through Digital & Physical Innovation is Looking for:

DevOps Automation Engineer

Dual-Cloud Infrastructure - AWS, Google Cloud, CockroachDB & Tailscale

About The Role

This role spans:

Dual-cloud infrastructure (AWS + GCP)
Developer workstation management
Security automation
Incident response
CI/CD pipeline operations

The platform is scaling to serve millions of consumers across 20,000+ veterinary clinics in 100+ countries, with tens of millions of projected concurrent sessions.

You must have real-world experience operating infrastructure at this scale, including:

Massive server loads
Database replication and failover at scale
Disaster recovery
Performance optimization under heavy concurrent traffic

The current production stack runs on ECS Fargate with RDS PostgreSQL, with CockroachDB planned for distributed workloads.

You will manage:

Tailscale VPN infrastructure
20+ EC2 dev boxes
Cloudflare for 23 zones
PagerDuty alerting
A robust security automation layer

This is a hands-on role with real operational ownership.

What You'll Do

Cloud Infrastructure (AWS + GCP)

Design, build, and manage dual-cloud infrastructure across AWS and Google Cloud Platform
Manage ECS Fargate deployments — task definitions, service discovery, ALB target groups, and blue/green deployments
Automate infrastructure provisioning using Terraform with modular, reusable configurations
Build and maintain CI/CD pipelines using GitLab CI and GitHub Actions
Manage containerized applications using Docker, ECS, and Kubernetes (EKS/GKE for planned workloads)
Support multi-tenant and multi-region application architectures across 6+ global regions
Implement and maintain CockroachDB clusters for distributed, geo-partitioned data (planned migration from RDS PostgreSQL)
Implement infrastructure cost optimization through:
Auto-scaling
Reserved capacity
Right-sizing
Spot instances
Savings Plans
Continuously monitor and reduce cloud spend across AWS and GCP
Optimize database costs through:
Right-sizing instances
Storage tiering
Reserved capacity
Query performance tuning

Developer Workstation Infrastructure

Provision and manage 20+ EC2 dev boxes across 3 AWS regions
Build custom AMIs using Packer for dev boxes and DERP relays
Deploy and maintain:
Memory watchdog
noVNC
CloudWatch agent configurations
Run fleet management commands across dev boxes via AWS Systems Manager (SSM)
Monitor dev box health and performance

Tailscale VPN Administration

Manage Tailscale ACL policies and user access
Operate custom DERP relays in 3 regions
Configure app connectors for SaaS IP lockdown
Maintain Mullvad VPN integration for egress control

Security Automation

Own GuardDuty, Security Hub, and AWS Config across all regions
Manage EventBridge rules for security alert routing
Build and manage:
IAM policies
Secrets management
WAF
Zero-trust networking
Administer GitHub Enterprise security, including:
Org management
IP allowlists
Secret scanning policies
Runner management

Scale, Performance & Disaster Recovery

Design and operate infrastructure capable of handling millions of concurrent users and tens of millions of sessions across global regions
Implement and manage auto-scaling policies, including:
ECS service auto-scaling
EC2 ASGs
RDS read replicas
Conduct load testing and capacity planning
Design and maintain database scaling strategies:
Read replicas
Connection pooling
Query optimization
Sharding for high-throughput workloads
Own disaster recovery (DR) planning and execution:
Multi-region failover
RTO/RPO targets
Automated recovery runbooks
Regular DR drills
Implement and manage database backup strategies:
Point-in-time recovery
Cross-region replication
Automated restore testing
Optimize CDN and edge caching (Cloudflare) for global traffic at scale
Monitor and resolve performance bottlenecks across:
Application servers
Databases
Caches
Network layers
Build runbooks for incident response during:
High-traffic events
Database failovers
Regional outages

Monitoring, Alerting & Incident Response

Configure and maintain PagerDuty
Monitor system performance using:
Prometheus
Grafana
CloudWatch
Cloud Monitoring
Manage EBS backup automation, including:
Daily backups
30-day retention
Cross-region copy
Vault lock

CI/CD & Repository Operations

Manage GitLab mirroring from GitHub
Maintain 45+ cron jobs on the admin box
Manage Cloudflare across 23 zones, including:
CDN
DNS
WAF configuration
Collaborate with developers to improve deployment workflows and reduce lead time

AI/ML Infrastructure & Tooling

Use Claude Code / Cursor for:
Terraform authoring
Script generation
Infrastructure debugging
Support AI/ML infrastructure, including:
GPU instance management
Model deployment pipelines
Maintain and improve AI-assisted monitoring and alerting
Support infrastructure requirements for AI-enabled platform capabilities

Must-Have Skills

3–5+ years of experience in DevOps / Cloud / Infrastructure Automation at scale
High-Scale Production Experience (Critical) — must have operated infrastructure serving millions of users with high concurrency
Experience with:
Server load management
Database scaling (read replicas, connection pooling, sharding)
Auto-scaling policies
Performance optimization under heavy traffic
Strong hands-on experience with AWS, including:
ECS Fargate
EKS
Lambda
S3
RDS
CloudFront
SQS
IAM
SSM
GuardDuty
Security Hub
Config
Working experience with Google Cloud Platform, including:
GKE
Cloud Run
BigQuery
Cloud Functions
IAM
ECS Fargate production experience
Terraform (Infrastructure as Code) with multi-environment, modular patterns
Tailscale VPN administration — ACLs, DERP relays, app connectors
Packer for AMI builds
Docker and container orchestration in production
Experience with GitLab CI/CD and/or GitHub Actions
GitHub Enterprise administration
Cloudflare CDN/DNS/WAF management
PagerDuty or equivalent incident response configuration
Production experience with CockroachDB or distributed SQL databases (or strong willingness to learn)
Disaster recovery planning and execution, including:
Multi-region failover
Backup automation
RTO/RPO targets
Recovery runbooks
Database performance optimization at scale, including:
Replication
Connection pooling
Query tuning
Capacity planning
Cost Optimization (Critical) — proven track record of reducing cloud infrastructure costs through:
Right-sizing
Reserved capacity
Spot instances
Storage tiering
Waste reduction
Good understanding of Linux systems, networking, and security fundamentals
Strong communication skills and ability to work in a remote, globally distributed team

Nice-to-Have Skills

Experience with Kubernetes tools:
Helm
ArgoCD
Flux
Experience with monitoring stacks:
Prometheus
Grafana
ELK
Loki
AWS Systems Manager fleet management at scale
Experience working in startup or fast-paced product environments
Scripting experience:
Bash
Python
Go
Experience supporting AI/ML workloads and GPU infrastructure
Experience with chaos engineering tools (Gremlin, Litmus) for resilience testing
FinOps certification or formal cloud cost management framework experience

What We're Looking For (Mindset)

Strong ownership and problem-solving mindset
Comfort working in a fast-growing, evolving environment
Ability to balance speed with stability and security
Willingness to learn and adapt to new tools and technologies
Clear, proactive communicator who surfaces issues early

How to apply for this opportunity

Step 1: Click On Apply! And Register or Login on our portal.
Step 2: Complete the Screening Form & Upload updated Resume
Step 3: Increase your chances to get shortlisted & meet the client for the Interview!

About Uplers:

Our goal is to make hiring reliable, simple, and fast. Our role will be to help all our talents find and apply for relevant contractual onsite opportunities and progress in their career. We will support any grievances or challenges you may face during the engagement.

(Note: There are many more opportunities apart from this on the portal. Depending on the assessments you clear, you can apply for them as well).

So, if you are ready for a new challenge, a great work environment, and an opportunity to take your career to the next level, don't hesitate to apply today. We are waiting for you!