Senior AI-HPC Storage Engineer
Company | NVIDIA |
---|---|
Location | Austin, TX, USA, Santa Clara, CA, USA, Westford, MA, USA |
Salary | $184000 – $356500 |
Type | Full-Time |
Degrees | Bachelor’s |
Experience Level | Senior, Expert or higher |
Requirements
- Bachelor’s degree in Computer Science, Electrical Engineering or related field or equivalent experience.
- 8+ years of experience designing and operating large scale storage infrastructure.
- Experience analyzing and tuning performance for a variety of AI/HPC workloads.
- Experience with one or more parallel or distributed filesystems such as Lustre, GPFS is a must.
- Proficient in Centos/RHEL and/or Ubuntu Linux distros including Python programming and bash scripting
- Experience architecture design and operation of storage solutions on any of the leading Cloud environment [AWS, Azure or GCP]
- Experience with AI/HPC cluster job schedulers such as SLURM, LSF
- In depth understanding of container technologies like Docker, Enroot
- Experience with AI/HPC workflows that use MPI
Responsibilities
- Research and implementation of distributed storage services.
- Design, implement an on-prem AI/HPC infrastructure supplemented with cloud computing to support the growing needs of NVIDIA.
- Design and implement scalable and efficient next-gen storage solutions tailored for data-intensive applications, optimizing performance and cost-effectiveness.
- Develop tooling to automate management of large-scale infrastructure environments, to automate operational monitoring and alerting, and to enable self-service consumption of resources.
- Document the general procedures and practices, perform technology evaluations, related to distributed file systems.
- Collaborate across teams to better understand developers’ workflows and gather their infrastructure requirements.
- Influence and guide methodologies for building, testing, and deploying applications to ensure optimal performance and resource utilization.
- Supporting our researchers to run their flows on our clusters including performance analysis and optimizations of deep learning workflows.
- Root cause analysis and suggest corrective action for problems large and small scales.
Preferred Qualifications
- Experience with NVIDIA GPUs, Cuda Programming, NCCL and MLPerf benchmarking
- Experience with Machine Learning and Deep Learning concepts, algorithms and models
- Familiarity with InfiniBand with IBOIP and RDMA
- Background with Software Defined Networking and AI/HPC cluster networking
- Familiarity with deep learning frameworks like PyTorch and TensorFlow