Posted in

Senior AI Cluster Tools Developer

Senior AI Cluster Tools Developer

CompanyNVIDIA
LocationSanta Clara, CA, USA
Salary$148000 – $287500
TypeFull-Time
DegreesBachelor’s
Experience LevelSenior

Requirements

  • BS+ in Computer Science or related (or equivalent experience) and 5+ years of software development
  • Strong software design and implementation ability with Python/Go/C++
  • Good understanding of Deep Learning and AI frameworks like Pytorch, TensorFlow and etc
  • Knowledge of AI cluster job scheduling, storage management and networking management
  • Knowledge of Linux kernel
  • Excellent problem solving skills and project management skills
  • Flexibility for working in an evolving environment with changing requirements

Responsibilities

  • Build internal perf/power profiling and analysis tools and platform for AI workloads at cluster scale
  • Build debugging tools for common encountered problems in GPU cluster
  • Work with our users to build / calibrate perf/power models for next generation HW or system
  • Partner with architects to propose new HW features or improve existing features with real world use cases

Preferred Qualifications

  • Proven experience in GPU cluster scale continuous profiling & analysis tools/platforms
  • Solid experience in large AI job troubleshooting and failure detection/recovery
  • Skillful in Deep Learning application performance analysis and optimization
  • Knowledgeable in GPU / CPU architecture and application performance or power efficiency analysis