Responsibilities
1.Design, deploy, maintain, and optimize GPU server clusters to support image algorithm training and AI workloads.
2.Manage the allocation and utilization of hardware resources such as GPU, CPU, memory, and storage to meet multi-department R&D requirements.
3.Optimize system performance and job scheduling strategies to improve resource utilization (e.g., using SLURM, Kubernetes, Docker, etc.).
4.Establish system monitoring, data security, automated backup, and disaster recovery mechanisms to ensure service stability.
5.Collaborate closely with algorithm teams to understand model training needs and provide customized infrastructure solutions.
6.Prepare system documentation and operational guidelines, and provide technical training and support to users.
7.Troubleshoot hardware and software issues related to servers, networks, storage, etc., and liaise with external vendors when necessary.
Qualifications:
1.Educational Background: Bachelor’s degree or higher in Computer Science, Information Systems, or a related field.
2.Work Experience: At least 3 years of experience in GPU server or AI infrastructure management, with preference given to candidates with AI training platform support experience.
3.Technical Skills: Familiarity with Linux system administration, server network configuration, and infrastructure automation deployment.
4.Proficiency in GPU cluster deployment and management tools (e.g., NVIDIA Docker, CUDA, SLURM, Kubernetes, etc.).
5.Knowledge of storage systems (e.g., RAID, NFS) and system monitoring tools (e.g., Prometheus, Grafana, etc.).
6.Understanding of mainstream deep learning frameworks (e.g., PyTorch, TensorFlow) and their dependencies on training environments.
7.Strong communication, coordination, and documentation skills, with the ability to collaborate effectively with algorithm, hardware, and software teams.