Senior AI-HPC Cluster Engineer
Company | NVIDIA |
---|---|
Location | Austin, TX, USA, Santa Clara, CA, USA, Durham, NC, USA, Westford, MA, USA |
Salary | $148000 – $356500 |
Type | Full-Time |
Degrees | Bachelor’s |
Experience Level | Senior |
Requirements
- Bachelor’s degree in Computer Science, Electrical Engineering or related field or equivalent experience
- Minimum 5 years of experience designing and operating large scale compute infrastructure
- Experience with AI/HPC advanced job schedulers, such as Slurm, K8s, RTDA or LSF
- Proficient in administering Centos/RHEL and/or Ubuntu Linux distributions
- Solid understanding of cluster configuration managements tools such as Ansible, Puppet, Salt
- In depth understanding of container technologies like Docker, Singularity, Podman, Shifter, Charliecloud
- Proficiency in Python programming and bash scripting
- Excellent problem-solving skills, with the ability to analyze complex systems, identify bottlenecks, and implement scalable solutions.
- Applied experience with AI/HPC workflows that use MPI
- Excellent communication and teamwork skills, with the ability to work effectively with diverse teams and individuals.
- Experience analyzing and tuning performance for a variety of AI/HPC workloads.
- Passion for continual learning and staying ahead of emerging technologies and effective approaches in the HPC and AI/ML infrastructure fields.
Responsibilities
- Provide leadership and strategic guidance on the management of large-scale HPC systems including the deployment of compute, networking, and storage.
- Develop and improve our ecosystem around GPU-accelerated computing including developing scalable automation solutions
- Build and maintain AI and ML heterogeneous clusters on-premises and in the cloud
- Create and cultivate customer and cross-team relationships to reliably sustain the clusters and meet user evolving user needs
- Support our researchers to run their workloads including performance analysis and optimizations
- Conduct root cause analysis and suggest corrective action Proactively find and fix issues before they occur.
Preferred Qualifications
- Experience with NVIDIA GPUs, CUDA Programming, NCCL and MLPerf benchmarking
- Experience with Machine Learning and Deep Learning concepts, algorithms and models
- Familiarity with InfiniBand with IBOP and RDMA
- Understanding of fast, distributed storage systems like Lustre and GPFS for AI/HPC workloads
- Familiarity with deep learning frameworks like PyTorch and TensorFlow