Search by job, company or skills

Tesla

Staff Site Reliability Engineer (SRE), S3 storage

Save
new job description bg glownew job description bg glow
  • Posted 2 days ago
  • Be among the first 10 applicants
Early Applicant

Job Description

Key Responsibilities:

  • System Reliability and Monitoring: Design and implement monitoring, alerting, and automation for S3 storage clusters to achieve 99.99%+ uptime. Use tools like Prometheus, Grafana, or Catchpoint to track performance metrics, capacity utilization, and anomaly detection.
  • Capacity Planning and Scaling: Forecast storage needs based on data growth trends (e.g., fleet expansion exceeding 80 PB) and proactively scale S3 buckets, lifecycle policies, and multi-region replication to support up to 150 PB+ capacities.
  • Incident Management: Lead on-call rotations, troubleshoot storage-related incidents (e.g., data access latency, replication failures), and perform root cause analysis using methodologies like blameless post-mortems.
  • Automation and Infrastructure as Code: Develop and maintain automation scripts (e.g., using Terraform, Ansible, or Python) for provisioning, configuring, and managing S3 resources, including security policies, encryption, and access controls.
  • Performance Optimization: Optimize data ingestion, retrieval, and archival processes to handle high-throughput workloads, reducing costs through intelligent tiering (e.g., S3 Intelligent-Tiering) and data compression.
  • Security and Compliance: Ensure storage systems comply with data protection standards (e.g., GDPR, SOC 2), implementing features like bucket policies, versioning, and encryption at rest/transit.
  • Collaboration and Innovation: Work with data engineering, AI, and energy teams to integrate S3 with other systems (e.g., Kubernetes, Spark). Contribute to open-source tools or internal projects for advanced storage solutions.
  • Documentation and Knowledge Sharing: Maintain runbooks, contribute to knowledge bases (e.g., in Confluence), and mentor junior engineers on best practices for object storage reliability.

  • Qualifications


  • Experience: 5+ years in SRE, DevOps, or systems engineering roles, with at least 3 years focused on AWS S3 or similar object storage (e.g., GCS, Azure Blob). Proven track record managing large-scale (PB-level) storage systems.
  • Technical Skills:
  • Expertise in AWS services (S3, EC2, Lambda, CloudWatch) and infrastructure tools (Terraform, Kubernetes, Docker).
  • Proficiency in scripting/programming (Python, Go, Bash) for automation and tooling.
  • Strong understanding of distributed systems, networking, and storage concepts (e.g., eventual consistency, CRR/SRR replication).
  • Experience with monitoring and logging tools (Prometheus, Grafana, Splunk).
  • Soft Skills: Excellent problem-solving abilities, strong communication skills, and a collaborative mindset. Ability to thrive in a fast-paced, high-stakes environment.
  • Education: Bachelor's degree in Computer Science, Engineering, or a related field (or equivalent experience).
  • More Info

    Job Type:
    Industry:
    Employment Type:

    About Company

    Job ID: 148621415

    Similar Jobs

    Bengaluru, India

    Skills:

    MavenSamlDockerTerraformPythonJfrog ArtifactoryBashNpmSsoJenkinsHelmPyPIRenovateGitHub ActionsSCIMGitGuardianCircleCIDependabotGitHub EnterpriseGitHub Migrations APIgh-migration-toolGitLab CIConanTruffleHogGitHub Advanced SecurityOIDC