Lead DevOps Engineer (Platform Reliability + CI/CD)

fegmo

Noida, India

7-12 Years

Save

Posted 3 days ago
Be among the first 10 applicants

Early Applicant

Job Description

About Fegmo

Fegmo is building an agentic commerce platform that keeps product and marketing content continuously channel-ready across every endpoint, marketplaces, retailer sites, and emerging AI agents. We help retailers, brands, and marketplaces automate ingestion, enrichment, validation, and syndication so teams launch faster, reduce manual work, and improve conversion.

Our platform runs on modern cloud and AI infrastructure (LLMs, retrieval, microservices) and powers workflows such as onboarding, taxonomy and attributes, media generation (images and video), localization, and compliance at enterprise scale.

We are expanding our engineering leadership team in India and building a strong, co-located execution culture in Noida.

Role Overview

We are seeking a hands-on Lead DevOps Engineer who will own the reliability and operational excellence of the Fegmo platform. This is a high-ownership role focused on making releases predictable, environments stable, and production observable, secure, and cost-efficient.

You will be based on-site in Noida and work closely with engineering leadership, backend/frontend teams, and AI engineers.

In the next 90 days, your primary mission is to harden CI/CD, deployments, observability, and environment hygiene so the platform can ship consistently with lower regression and incident risk.

Key Responsibilities
CI/CD and Release Engineering

Own and improve GitHub Actions pipelines, build and deploy automation, and release readiness practices.
Introduce practical release gates (automated checks, smoke tests, deployment verification) to reduce rollback and hotfix frequency.
Reduce build times and improve deployment safety (progressive rollouts, safe config changes, repeatable runbooks).

Cloud Infrastructure and Environments (GCP)

Own deployments and runtime reliability on Google Cloud (Cloud Run, GCE, GCS), including environment consistency across dev, staging, and production.
Improve IAM hygiene, secrets handling, and least-privilege access patterns.
Optimize Docker images and runtime performance (cold starts, resource sizing, scaling behavior).

Observability, Incident Response, and Reliability

Implement and standardize logging, metrics, alerting, and dashboards for core services and critical workflows.
Establish on-call and incident practices (triage, severity, communication, RCA, follow-ups) that improve stability over time.
Define reliability targets and operational checks for critical jobs, integrations, and data pipelines.

Cost and Performance Optimization

Monitor and optimize cloud spend for services, storage, and AI workloads, and implement cost guardrails.
Improve performance bottlenecks across deployments and background processing, and help teams choose pragmatic scaling strategies.

Platform Security Basics

Enforce secure defaults across environments: secrets rotation, auditability, access reviews, dependency hygiene, and secure configuration practices.
Partner with engineering teams to ensure secure-by-default deployment patterns and environment hygiene.

Support for AI and Data Workloads (light MLOps)

Support AI services with reliable deployments, monitoring, and cost/latency controls (model calls, RAG services, batch jobs).
Enable safe experimentation without destabilizing production environments.

Required Skills and Experience

712+ years in DevOps, SRE, or platform engineering roles, including ownership of production systems.
Strong experience with CI/CD (GitHub Actions or similar), Docker, and Linux.
Strong experience on GCP (Cloud Run, GCE, storage, IAM) and operating cloud-native services.
Proven track record improving release reliability, observability, and incident response practices.
Strong fundamentals in networking, security hygiene, secrets management, and access controls.
Comfort collaborating with engineers to set standards and drive adoption (not just operating pipelines).

Nice-to-Have

Infrastructure as code (Terraform), Kubernetes, GitOps practices.
Experience supporting AI/ML workloads (MLOps, evaluation pipelines, batch processing, cost controls).
Experience operating integration-heavy SaaS platforms (connectors, retries, idempotency, error taxonomy, monitoring).
Experience with database operations and performance tuning (MySQL/PostgreSQL) and queue/job systems.

Why Join Fegmo

Own platform reliability and delivery velocity at an early-stage AI-native company.
Build the operational foundation that enables fast product iteration without breaking quality.
High ownership and impact, with the opportunity to take on expanded scope as the team grows.

Culture Note

We are building a thoughtful, mission-driven team that values collaboration, creativity, and inclusion. We welcome applicants from all backgrounds and are especially excited to hear from candidates with unique perspectives and a passion for building meaningful tools.