Search by job, company or skills

Techolution

Site Reliability Engineer

Fresher
Save
new job description bg glownew job description bg glow
  • Posted 2 days ago
  • Be among the first 10 applicants
Early Applicant

Job Description

Ready to architect the reliability backbone of cloud-native platforms that serve hundreds of millions of users Join us in making the leap from Lab Grade AI to Real World AI, leveraging your skills in CI/CD, monitoring and microservices, to build the enterprise of tomorrow.

Techolution is searching for a dynamic Senior Site Reliability Engineer (SRE) with deep, hands-on experience in real-world enterprise environments. If you possess production expertise in AWS, EKS, and Terraform, focusing on engineering reliable, observable and scalable cloud systems, and have a proven track record of leading incident response and driving platform-wide improvements, we want you.

Designation: Senior Site Reliability Engineer (SRE)

Location: Remote

Employment Type: Full Time

Shift Timings: 6 PM IST to 2:30 AM IST

Please note, we are only considering people who are in notice, or are immediate joiners. If you are not serving notice, or have 60 days/90 days notice, please refrain from applying.

Key Responsibilities:

  • Own production reliability and incident response for cloud-native services on AWS, including SLO/SLI definition, error-budget management, and end-to-end leadership of Sev-1 and Sev-2 events.
  • Architect, deploy, and operate containerized workloads on EKS (Kubernetes), ensuring scalable, secure, and zero-downtime applications across multiple environments.
  • Design and manage infrastructure programmatically using Terraform, driving consistency, drift detection, and policy-as-code across multi-account AWS landing zones.
  • Engineer and maintain CI/CD pipelines using tools such as Jenkins, GitHub Actions, or GitLab CI, streamlining software release cycles and improving deployment frequency and safety.
  • Build observability across the stack using monitoring and logging tools like New Relic, Prometheus, Grafana, and the ELK Stack, designing alerts that fire on user-impacting symptoms rather than noise.
  • Troubleshoot complex production issues across the microservices architecture, performing deep root-cause analysis and driving lasting fixes through post-incident reviews.

Technical Skills:

  • Deep production expertise in AWS: Hands-on experience across EC2, VPC, IAM, S3, RDS, CloudFront, Route53, and KMS at multi-account scale. AWS is the foundation of our client engagement, and your fluency here directly determines Day-1 impact.
  • Strong experience with EKS and Kubernetes: Production operation of clusters including upgrades, autoscaling, networking, secrets management, and resolving noisy-neighbor and resource-starvation scenarios. This is the orchestration layer for the entire platform.
  • Mastery of Terraform: Module design, remote state management, workspaces, drift detection, and CI-integrated plan and apply workflows. You will be the technical custodian of how infrastructure is shipped.
  • Hands-on engineering of CI/CD pipelines (Jenkins, GitHub Actions, or GitLab CI): Including build, test, security-gate, signing, and progressive-delivery stages. Release velocity depends on the pipelines you own.
  • Elite Troubleshooting Skills: Calm, methodical, hypothesis-driven debugging across the network, OS, runtime, and application layers in real time. This is the single highest-leverage skill in the role.
  • Working knowledge of Monitoring and Logging Tools (New Relic, Prometheus, ELK Stack): Production exposure to designing dashboards, alerts, and distributed traces that surface real customer impact.
  • Strong grasp of Microservices Architecture: Fluency with distributed-system patterns including service discovery, retries, idempotency, circuit breakers, and async messaging.
  • Exposure to AWS CDK and Lambda: Comfort building infrastructure and event-driven systems programmatically to reduce toil and extend the platform.
  • Preferred development skills in Java and/or JavaScript/TypeScript: Enough to read service code, ship small fixes, and pair productively with application engineers.
  • Active certification: at least one of AWS Solutions Architect (Associate or Professional), AWS Developer Associate, AWS DevOps Engineer Professional, or a Kubernetes certification (CKA or CKAD). Required by the client engagement and a signal of continued investment in craft.

Foundational Must Haves:

  • Exceptional Collaboration and Communication Skills: You will work directly with senior client stakeholders, write incident reports that read like product docs, and represent Techolution's engineering bar in client forums.
  • Demonstrated Ownership: Taking full responsibility for production systems from inception to incident closure, and proactively seeking improvements rather than waiting for tickets. This mindset is critical for driving reliability forward.
  • Possession of a Seeker Mindset: A relentless curiosity about how systems fail and an obsession with making them fail less, paired with eagerness to learn new technologies in the rapidly evolving cloud landscape.
  • Genuine Passion Towards Work: A deep enthusiasm for engineering craft and problem-solving, translating into high-quality contributions and a positive impact on our team and clients.
  • Displaying an Extremely Ambitious drive: A strong desire to excel, push boundaries, and contribute significantly to Techolution's innovative goals and client success — including the resilience to operate on a US-aligned shift.
  • Unwavering Unbeatable Work Ethics: A commitment to diligence, reliability, and integrity in all aspects of your work, ensuring consistent high performance and trust within the team and with the client.
  • Exceptional Ability to comprehend: The capacity to quickly understand complex technical architectures, project requirements, and team discussions, enabling effective problem-solving and collaboration.

Negotiable Skills:

  • Exposure to advanced Kubernetes ecosystem tools (Helm, KEDA, Karpenter, service mesh): Experience operating these in production to handle complex autoscaling and traffic management scenarios.
  • Knowledge of advanced observability practices (distributed tracing, RED/USE metrics, SLO engineering): Designing telemetry that reflects customer experience rather than just infrastructure health.
  • Familiarity with Chaos Engineering tools (Gremlin, AWS FIS, Chaos Monkey): Practical experience injecting failure to validate system resilience before incidents do.
  • Basic understanding of Database Administration (RDS, Aurora, DynamoDB): Knowledge of fundamental database concepts and operations, useful for managing data persistence layers in production applications.
  • Exposure to AI/ML workload reliability (model serving, GPU node groups, inference autoscaling): A strong plus given Techolution's focus on real-world AI in production environments.

More Info

Job Type:
Industry:
Employment Type:

About Company

Job ID: 148484547

Similar Jobs

India

Skills:

TerraformAzureKubernetesPythonAWSGKEincident management toolsAKSobservability toolsEKS

Chakdaha, India

Skills:

LinuxAnsiblePrometheusGrafanaWindowsKubernetesAtlassian toolsOllaman8nConfiguration Management toolsLoki

Bengaluru, India

Skills:

SamlPrometheusKafkaDatadogOktaTerraformDockerTeamcityPythonAWSOauthRDSSsoJenkinsCloudwatchBitbucketSqsAWS IAMHelmKubernetesGitOpsGoTerragruntAuroraGitHub ActionsOpenTelemetryLinux systemsIngress networkingEKSFluxCDcni

Bengaluru, India

Skills:

JavaTcpPrometheusBashHttpGrafanaLinuxDockerSplunkKubernetesPythonIp

Pune, India

Skills:

GolangTerraformLinuxAnsibleHelmKubernetesPythonAWSArgoCD