Responsibilities
About The Team Backed by ByteDance's world-leading core algorithm businesses in recommendation, advertising, and search, the Data-AML team is dedicated to building high-performance, highly available machine learning storage systems that support trillion-parameter models. We tackle the extreme challenges of globalized, ultra-large-scale clusters, while playing a key role in the development and evolution of machine learning infrastructure. In this team, you'll have the opportunity to sharpen your expertise in multiple subdirections, being model serving, model training, scheduling and orchestration. You are working in the team serving very centric machine learning services at ByteDance with the highest level of availability, as well as creating highly automated systems and pipelines. Responsibilities - Responsible for production operations management and stability assurance of AML training, inference, and storage systems, covering core pipelines such as scheduling and orchestration, Kubernetes (K8s)/GPU clusters, distributed training, online inference serving, and Parameter Server/NoSQL storage. - Build and maintain SLO/SLA frameworks, observability, alerting, on-call processes, incident diagnosis, self-healing mechanisms, disaster recovery, and post-incident review (postmortem) practices. - Drive engineering capabilities including CI/CD, canary/gradual deployments, automated rollback, system health inspections, pre-flight checks, capacity forecasting, and elastic scaling. - Lead resource governance and optimization across GPU, CPU, storage, and network infrastructure, including quota management, cost attribution, and performance tuning, to improve system availability, resource utilization, and engineering productivity.
Qualifications
Minimum Qualification(s) - Bachelor's degree or above in Computer Science, Software Engineering, Artificial Intelligence, or a related field. - Proficient in Linux and skilled in at least one of the following programming languages: Shell, Python, Go, or C++. - Familiar with machine learning training and inference architectures, Kubernetes, GPU clusters, or distributed storage systems. - Experienced in production issue troubleshooting, performance analysis, and automation platform development. - Strong sense of ownership, solid analytical and problem-solving skills, and the ability to drive cross-functional collaboration to resolve complex technical challenges. Preferred Qualification(s) - Prior experience with large-scale training, inference, or storage platforms, SLO governance, FinOps practices, NoSQL systems, or open-source infrastructure projects. - Familiar with the Kubernetes (K8s) ecosystem, with hands-on experience in operating and governing large-scale containerized clusters, including areas such as Operators, declarative operations, and release protection mechanisms. - Familiar with recommendation and advertising system architectures, with experience in AI infrastructure components such as Parameter Servers or KV Caches (e.g., Mooncake).