Skip to content

Senior AI Cluster Tools Developer
Company | NVIDIA |
---|
Location | Santa Clara, CA, USA |
---|
Salary | $148000 – $287500 |
---|
Type | Full-Time |
---|
Degrees | Bachelor’s |
---|
Experience Level | Senior |
---|
Requirements
- BS+ in Computer Science or related (or equivalent experience) and 5+ years of software development
- Strong software design and implementation ability with Python/Go/C++
- Good understanding of Deep Learning and AI frameworks like Pytorch, TensorFlow and etc
- Knowledge of AI cluster job scheduling, storage management and networking management
- Knowledge of Linux kernel
- Excellent problem solving skills and project management skills
- Flexibility for working in an evolving environment with changing requirements
Responsibilities
- Build internal perf/power profiling and analysis tools and platform for AI workloads at cluster scale
- Build debugging tools for common encountered problems in GPU cluster
- Work with our users to build / calibrate perf/power models for next generation HW or system
- Partner with architects to propose new HW features or improve existing features with real world use cases
Preferred Qualifications
- Proven experience in GPU cluster scale continuous profiling & analysis tools/platforms
- Solid experience in large AI job troubleshooting and failure detection/recovery
- Skillful in Deep Learning application performance analysis and optimization
- Knowledgeable in GPU / CPU architecture and application performance or power efficiency analysis