Posted in

Senior AI-HPC Cluster Engineer

Senior AI-HPC Cluster Engineer

CompanyNVIDIA
LocationAustin, TX, USA, Santa Clara, CA, USA, Durham, NC, USA, Westford, MA, USA
Salary$148000 – $356500
TypeFull-Time
DegreesBachelor’s
Experience LevelSenior

Requirements

  • Bachelor’s degree in Computer Science, Electrical Engineering or related field or equivalent experience
  • Minimum 5 years of experience designing and operating large scale compute infrastructure
  • Experience with AI/HPC advanced job schedulers, such as Slurm, K8s, RTDA or LSF
  • Proficient in administering Centos/RHEL and/or Ubuntu Linux distributions
  • Solid understanding of cluster configuration managements tools such as Ansible, Puppet, Salt
  • In depth understanding of container technologies like Docker, Singularity, Podman, Shifter, Charliecloud
  • Proficiency in Python programming and bash scripting
  • Excellent problem-solving skills, with the ability to analyze complex systems, identify bottlenecks, and implement scalable solutions.
  • Applied experience with AI/HPC workflows that use MPI
  • Excellent communication and teamwork skills, with the ability to work effectively with diverse teams and individuals.
  • Experience analyzing and tuning performance for a variety of AI/HPC workloads.
  • Passion for continual learning and staying ahead of emerging technologies and effective approaches in the HPC and AI/ML infrastructure fields.

Responsibilities

  • Provide leadership and strategic guidance on the management of large-scale HPC systems including the deployment of compute, networking, and storage.
  • Develop and improve our ecosystem around GPU-accelerated computing including developing scalable automation solutions
  • Build and maintain AI and ML heterogeneous clusters on-premises and in the cloud
  • Create and cultivate customer and cross-team relationships to reliably sustain the clusters and meet user evolving user needs
  • Support our researchers to run their workloads including performance analysis and optimizations
  • Conduct root cause analysis and suggest corrective action Proactively find and fix issues before they occur.

Preferred Qualifications

  • Experience with NVIDIA GPUs, CUDA Programming, NCCL and MLPerf benchmarking
  • Experience with Machine Learning and Deep Learning concepts, algorithms and models
  • Familiarity with InfiniBand with IBOP and RDMA
  • Understanding of fast, distributed storage systems like Lustre and GPFS for AI/HPC workloads
  • Familiarity with deep learning frameworks like PyTorch and TensorFlow