Posted in

Senior AI-HPC Storage Engineer

Senior AI-HPC Storage Engineer

CompanyNVIDIA
LocationAustin, TX, USA, Santa Clara, CA, USA, Westford, MA, USA
Salary$184000 – $356500
TypeFull-Time
DegreesBachelor’s
Experience LevelSenior, Expert or higher

Requirements

  • Bachelor’s degree in Computer Science, Electrical Engineering or related field or equivalent experience.
  • 8+ years of experience designing and operating large scale storage infrastructure.
  • Experience analyzing and tuning performance for a variety of AI/HPC workloads.
  • Experience with one or more parallel or distributed filesystems such as Lustre, GPFS is a must.
  • Proficient in Centos/RHEL and/or Ubuntu Linux distros including Python programming and bash scripting
  • Experience architecture design and operation of storage solutions on any of the leading Cloud environment [AWS, Azure or GCP]
  • Experience with AI/HPC cluster job schedulers such as SLURM, LSF
  • In depth understanding of container technologies like Docker, Enroot
  • Experience with AI/HPC workflows that use MPI

Responsibilities

  • Research and implementation of distributed storage services.
  • Design, implement an on-prem AI/HPC infrastructure supplemented with cloud computing to support the growing needs of NVIDIA.
  • Design and implement scalable and efficient next-gen storage solutions tailored for data-intensive applications, optimizing performance and cost-effectiveness.
  • Develop tooling to automate management of large-scale infrastructure environments, to automate operational monitoring and alerting, and to enable self-service consumption of resources.
  • Document the general procedures and practices, perform technology evaluations, related to distributed file systems.
  • Collaborate across teams to better understand developers’ workflows and gather their infrastructure requirements.
  • Influence and guide methodologies for building, testing, and deploying applications to ensure optimal performance and resource utilization.
  • Supporting our researchers to run their flows on our clusters including performance analysis and optimizations of deep learning workflows.
  • Root cause analysis and suggest corrective action for problems large and small scales.

Preferred Qualifications

  • Experience with NVIDIA GPUs, Cuda Programming, NCCL and MLPerf benchmarking
  • Experience with Machine Learning and Deep Learning concepts, algorithms and models
  • Familiarity with InfiniBand with IBOIP and RDMA
  • Background with Software Defined Networking and AI/HPC cluster networking
  • Familiarity with deep learning frameworks like PyTorch and TensorFlow