Drive the design, automation, and reliability of Albert Invent's core platform to support scalable, high-performance AI applications.
You will partner closely with Product Engineering and SRE teams to ensure security, resiliency, and developer productivity while owning end-to-end service operability.
Key Responsibilities
- Own the design, reliability, and operability of Albert's mission-critical platform.
- Work closely with Product Engineering and SRE to build scalable, secure, and high-performance services.
- Plan and deliver core platform capabilities that improve developer velocity, system resilience, and scalability.
- Maintain a deep understanding of microservices topology, dependencies, and behavior.
- Act as the technical authority for performance, reliability, and availability across services.
- Drive automation and orchestration across infrastructure and operations.
- Serve as the final escalation point for complex or undocumented production issues.
- Lead root-cause analysis, mitigation strategies, and long-term system improvements.
- Mentor engineers in building robust, automated, and production-grade systems.
- Champion best practices in SRE, reliability, and platform engineering.
Must-Have Requirements
Bachelor's degree in Computer Science, Engineering, or equivalent practical experience.4+ years of strong backend coding in Python or Node.js.4+ years of overall software engineering experience, including 2+ years in an SRE / automation-focused role.Strong hands-on experience with Infrastructure as Code (Terraform preferred).Deep experience with AWS cloud infrastructure and distributed systems (microservices, APIs, service-to-service communication).Experience with observability systems logs, metrics, and tracing.Experience using CI/CD pipelines (e.g., CircleCI).Performance testing experience using K6 or similar tools.Strong focus on automation, standards, and operational excellence.Experience building low-latency APIs (
Ability to work in fast-paced, high-ownership environments.Proven ability to lead technically, mentor engineers, and influence engineering quality.
Good-to-Have Skills
- Kubernetes and container orchestration experience.
- Observability tools such as Prometheus, Grafana, OpenTelemetry, Datadog.
- Experience building Internal Developer Platforms (IDPs) or reusable engineering frameworks.
- Exposure to ML infrastructure or data engineering pipelines.
- Experience working in compliance-driven environments (SOC2, HIPAA, etc.).
Skills:- Automation, Terraform, Python, NodeJS (Node.js) and Amazon Web Services (AWS)