Search by job, company or skills

Uplers

DevOps Automation Engineer

Save
new job description bg glownew job description bg glow
  • Posted a day ago
  • Be among the first 20 applicants
Early Applicant

Job Description

Experience: 3.00 + years

Salary: Confidential (based on experience)

Expected Notice Period: 7 Days

Shift: (GMT+05:30) Asia/Kolkata (IST)

Opportunity Type: Remote

Placement Type: Full Time Indefinite Contract(40 hrs a week/160 hrs a month)

(*Note: This is a requirement for one of Uplers client - Strategic Transformation Through Digital & Physical Innovation)

What do you need for this opportunity

Must have skills required:

Grafana, Kubernetes tools, Monitoring tools, Promotheus, Scripting, CI/CD, CockroachDB, Terraforms, AWS, Docker, GCP, Github, Kubernetes

Strategic Transformation Through Digital & Physical Innovation is Looking for:

DevOps Automation Engineer

Dual-Cloud Infrastructure - AWS, Google Cloud, CockroachDB & Tailscale

About The Role

This role spans:

  • Dual-cloud infrastructure (AWS + GCP)
  • Developer workstation management
  • Security automation
  • Incident response
  • CI/CD pipeline operations

The platform is scaling to serve millions of consumers across 20,000+ veterinary clinics in 100+ countries, with tens of millions of projected concurrent sessions.

You must have real-world experience operating infrastructure at this scale, including:

  • Massive server loads
  • Database replication and failover at scale
  • Disaster recovery
  • Performance optimization under heavy concurrent traffic

The current production stack runs on ECS Fargate with RDS PostgreSQL, with CockroachDB planned for distributed workloads.

You will manage:

  • Tailscale VPN infrastructure
  • 20+ EC2 dev boxes
  • Cloudflare for 23 zones
  • PagerDuty alerting
  • A robust security automation layer

This is a hands-on role with real operational ownership.

What You'll Do

Cloud Infrastructure (AWS + GCP)

  • Design, build, and manage dual-cloud infrastructure across AWS and Google Cloud Platform
  • Manage ECS Fargate deployments — task definitions, service discovery, ALB target groups, and blue/green deployments
  • Automate infrastructure provisioning using Terraform with modular, reusable configurations
  • Build and maintain CI/CD pipelines using GitLab CI and GitHub Actions
  • Manage containerized applications using Docker, ECS, and Kubernetes (EKS/GKE for planned workloads)
  • Support multi-tenant and multi-region application architectures across 6+ global regions
  • Implement and maintain CockroachDB clusters for distributed, geo-partitioned data (planned migration from RDS PostgreSQL)
  • Implement infrastructure cost optimization through:
  • Auto-scaling
  • Reserved capacity
  • Right-sizing
  • Spot instances
  • Savings Plans
  • Continuously monitor and reduce cloud spend across AWS and GCP
  • Optimize database costs through:
  • Right-sizing instances
  • Storage tiering
  • Reserved capacity
  • Query performance tuning

Developer Workstation Infrastructure


  • Provision and manage 20+ EC2 dev boxes across 3 AWS regions
  • Build custom AMIs using Packer for dev boxes and DERP relays
  • Deploy and maintain:
  • Memory watchdog
  • noVNC
  • CloudWatch agent configurations
  • Run fleet management commands across dev boxes via AWS Systems Manager (SSM)
  • Monitor dev box health and performance

Tailscale VPN Administration


  • Manage Tailscale ACL policies and user access
  • Operate custom DERP relays in 3 regions
  • Configure app connectors for SaaS IP lockdown
  • Maintain Mullvad VPN integration for egress control

Security Automation


  • Own GuardDuty, Security Hub, and AWS Config across all regions
  • Manage EventBridge rules for security alert routing
  • Build and manage:
  • IAM policies
  • Secrets management
  • WAF
  • Zero-trust networking
  • Administer GitHub Enterprise security, including:
  • Org management
  • IP allowlists
  • Secret scanning policies
  • Runner management

Scale, Performance & Disaster Recovery


  • Design and operate infrastructure capable of handling millions of concurrent users and tens of millions of sessions across global regions
  • Implement and manage auto-scaling policies, including:
  • ECS service auto-scaling
  • EC2 ASGs
  • RDS read replicas
  • Conduct load testing and capacity planning
  • Design and maintain database scaling strategies:
  • Read replicas
  • Connection pooling
  • Query optimization
  • Sharding for high-throughput workloads
  • Own disaster recovery (DR) planning and execution:
  • Multi-region failover
  • RTO/RPO targets
  • Automated recovery runbooks
  • Regular DR drills
  • Implement and manage database backup strategies:
  • Point-in-time recovery
  • Cross-region replication
  • Automated restore testing
  • Optimize CDN and edge caching (Cloudflare) for global traffic at scale
  • Monitor and resolve performance bottlenecks across:
  • Application servers
  • Databases
  • Caches
  • Network layers
  • Build runbooks for incident response during:
  • High-traffic events
  • Database failovers
  • Regional outages

Monitoring, Alerting & Incident Response


  • Configure and maintain PagerDuty
  • Monitor system performance using:
  • Prometheus
  • Grafana
  • CloudWatch
  • Cloud Monitoring
  • Manage EBS backup automation, including:
  • Daily backups
  • 30-day retention
  • Cross-region copy
  • Vault lock

CI/CD & Repository Operations


  • Manage GitLab mirroring from GitHub
  • Maintain 45+ cron jobs on the admin box
  • Manage Cloudflare across 23 zones, including:
  • CDN
  • DNS
  • WAF configuration
  • Collaborate with developers to improve deployment workflows and reduce lead time

AI/ML Infrastructure & Tooling


  • Use Claude Code / Cursor for:
  • Terraform authoring
  • Script generation
  • Infrastructure debugging
  • Support AI/ML infrastructure, including:
  • GPU instance management
  • Model deployment pipelines
  • Maintain and improve AI-assisted monitoring and alerting
  • Support infrastructure requirements for AI-enabled platform capabilities

Must-Have Skills


  • 3–5+ years of experience in DevOps / Cloud / Infrastructure Automation at scale
  • High-Scale Production Experience (Critical) — must have operated infrastructure serving millions of users with high concurrency
  • Experience with:
  • Server load management
  • Database scaling (read replicas, connection pooling, sharding)
  • Auto-scaling policies
  • Performance optimization under heavy traffic
  • Strong hands-on experience with AWS, including:
  • ECS Fargate
  • EKS
  • Lambda
  • S3
  • RDS
  • CloudFront
  • SQS
  • IAM
  • SSM
  • GuardDuty
  • Security Hub
  • Config
  • Working experience with Google Cloud Platform, including:
  • GKE
  • Cloud Run
  • BigQuery
  • Cloud Functions
  • IAM
  • ECS Fargate production experience
  • Terraform (Infrastructure as Code) with multi-environment, modular patterns
  • Tailscale VPN administration — ACLs, DERP relays, app connectors
  • Packer for AMI builds
  • Docker and container orchestration in production
  • Experience with GitLab CI/CD and/or GitHub Actions
  • GitHub Enterprise administration
  • Cloudflare CDN/DNS/WAF management
  • PagerDuty or equivalent incident response configuration
  • Production experience with CockroachDB or distributed SQL databases (or strong willingness to learn)
  • Disaster recovery planning and execution, including:
  • Multi-region failover
  • Backup automation
  • RTO/RPO targets
  • Recovery runbooks
  • Database performance optimization at scale, including:
  • Replication
  • Connection pooling
  • Query tuning
  • Capacity planning
  • Cost Optimization (Critical) — proven track record of reducing cloud infrastructure costs through:
  • Right-sizing
  • Reserved capacity
  • Spot instances
  • Storage tiering
  • Waste reduction
  • Good understanding of Linux systems, networking, and security fundamentals
  • Strong communication skills and ability to work in a remote, globally distributed team

Nice-to-Have Skills


  • Experience with Kubernetes tools:
  • Helm
  • ArgoCD
  • Flux
  • Experience with monitoring stacks:
  • Prometheus
  • Grafana
  • ELK
  • Loki
  • AWS Systems Manager fleet management at scale
  • Experience working in startup or fast-paced product environments
  • Scripting experience:
  • Bash
  • Python
  • Go
  • Experience supporting AI/ML workloads and GPU infrastructure
  • Experience with chaos engineering tools (Gremlin, Litmus) for resilience testing
  • FinOps certification or formal cloud cost management framework experience

What We're Looking For (Mindset)


  • Strong ownership and problem-solving mindset
  • Comfort working in a fast-growing, evolving environment
  • Ability to balance speed with stability and security
  • Willingness to learn and adapt to new tools and technologies
  • Clear, proactive communicator who surfaces issues early

How to apply for this opportunity


  • Step 1: Click On Apply! And Register or Login on our portal.
  • Step 2: Complete the Screening Form & Upload updated Resume
  • Step 3: Increase your chances to get shortlisted & meet the client for the Interview!

About Uplers:


Our goal is to make hiring reliable, simple, and fast. Our role will be to help all our talents find and apply for relevant contractual onsite opportunities and progress in their career. We will support any grievances or challenges you may face during the engagement.

(Note: There are many more opportunities apart from this on the portal. Depending on the assessments you clear, you can apply for them as well).

So, if you are ready for a new challenge, a great work environment, and an opportunity to take your career to the next level, don't hesitate to apply today. We are waiting for you!





More Info

Job Type:
Industry:
Function:
Employment Type:

About Company

Job ID: 147691769