Production Engineer - Applied Machine Learning Engine (Singapore)

Byte Dance

Singapore

Fresher

Save

Posted 5 hours ago
Be among the first 10 applicants

Early Applicant

Job Description

Responsibilities

About The Team Backed by ByteDance's world-leading core algorithm businesses in recommendation, advertising, and search, the Data-AML team is dedicated to building high-performance, highly available machine learning storage systems that support trillion-parameter models. We tackle the extreme challenges of globalized, ultra-large-scale clusters, while playing a key role in the development and evolution of machine learning infrastructure. In this team, you'll have the opportunity to sharpen your expertise in multiple subdirections, being model serving, model training, scheduling and orchestration. You are working in the team serving very centric machine learning services at ByteDance with the highest level of availability, as well as creating highly automated systems and pipelines. Responsibilities - Responsible for production operations management and stability assurance of AML training, inference, and storage systems, covering core pipelines such as scheduling and orchestration, Kubernetes (K8s)/GPU clusters, distributed training, online inference serving, and Parameter Server/NoSQL storage. - Build and maintain SLO/SLA frameworks, observability, alerting, on-call processes, incident diagnosis, self-healing mechanisms, disaster recovery, and post-incident review (postmortem) practices. - Drive engineering capabilities including CI/CD, canary/gradual deployments, automated rollback, system health inspections, pre-flight checks, capacity forecasting, and elastic scaling. - Lead resource governance and optimization across GPU, CPU, storage, and network infrastructure, including quota management, cost attribution, and performance tuning, to improve system availability, resource utilization, and engineering productivity.

Qualifications

Minimum Qualification(s) - Bachelor's degree or above in Computer Science, Software Engineering, Artificial Intelligence, or a related field. - Proficient in Linux and skilled in at least one of the following programming languages: Shell, Python, Go, or C++. - Familiar with machine learning training and inference architectures, Kubernetes, GPU clusters, or distributed storage systems. - Experienced in production issue troubleshooting, performance analysis, and automation platform development. - Strong sense of ownership, solid analytical and problem-solving skills, and the ability to drive cross-functional collaboration to resolve complex technical challenges. Preferred Qualification(s) - Prior experience with large-scale training, inference, or storage platforms, SLO governance, FinOps practices, NoSQL systems, or open-source infrastructure projects. - Familiar with the Kubernetes (K8s) ecosystem, with hands-on experience in operating and governing large-scale containerized clusters, including areas such as Operators, declarative operations, and release protection mechanisms. - Familiar with recommendation and advertising system architectures, with experience in AI infrastructure components such as Parameter Servers or KV Caches (e.g., Mooncake).

More Info

Job Type:

Permanent Job

Industry:

IT /Computers - Software

Function:

Machine Learning Infrastructure

Employment Type:

Full time

About Company

Byte DanceJob Source: jobs.bytedance.com

ByteDance is a technology company operating a range of content platforms that inform, educate, entertain and inspire people across languages, cultures, and geographies.
Dedicated to building global platforms of creation and interaction, ByteDance now has a portfolio of applications available in over 150 markets and 75 languages. For example, TikTok, Helo, Vigo Video, Douyin, and Huoshan.
Dedicated to building global platforms of creation and interaction, ByteDance now has a portfolio of applications available in over 150 markets and 75 languages. For example, TikTok, Helo, Vigo Video, Douyin, and Huoshan.

Job ID: 149299441

Jobs by Skill - IT

Jobs by Skill - Non IT

International Jobs

Last Updated: 19-06-2026 02:34:10 AM

Homejobs in SingaporeProduction Engineer - Applied Machine Learning Engine (Singapore)

Similar Jobs

Production Engineer, Applied Machine Learning Engine (Singapore)

Byte Dance

Singapore

Skills:

Performance Tuning, Shell, Linux, Kubernetes, Python, Parameter Server, automated rollback, distributed training, Go, online inference serving, NoSQL storage, GPU clusters