Skip to content

Principal AI Infrastructure SRE Engineer
Company | NVIDIA |
---|
Location | Santa Clara, CA, USA |
---|
Salary | $248000 – $391000 |
---|
Type | Full-Time |
---|
Degrees | Bachelor’s |
---|
Experience Level | Expert or higher |
---|
Requirements
- Bachelor’s degree in Engineering, Computer Science, Mathematics, or related field, or equivalent experience.
- 15+ years of proven experience in compute platform engineering with a focus on automation.
- Experience with design, deployment and operation of infrastructure that supports AI and SW development at scale including Kubernetes, integrating modern AI Data infrastructure platforms into Kubernetes workloads.
- Proven experience integrating existing application architectures and build new identify opportunities for containerization to improve scalability, reliability, and efficiency.
- Proficiency in programming languages such as Go and/or Python. Experience in developing tools for data analysis and performance profiling, Development with Terraform, Config Management tools.
- Experience with designing and running large environments consisting of BareMetal servers/virtualized environments with a mix of tens of thousands of VMs and cloud infrastructure or AI infrastructure.
- Deep understanding of other infrastructure components like Storage, DNS, AD, Security Tools etc..
Responsibilities
- Lead initiatives to transform IT Infrastructure platform architecture and services On-Prem for modern AI workloads and AI semi conductor and software development.
- Collaborate with partners to design architecture, Build & Operate platforms that transform Storage, Compute & Middleware with modern security paradigms.
- Build software and automation to run infrastructure at scale with minimal human intervention. Develop and maintain tools for collecting, analyzing, and visualizing data for reporting, alerting, monitoring.
- Collect and review system data for capacity and planning purposes, analyze capacity data and develop plans for appropriate level enterprise-wide systems, and coordinate with management personnel in implementing changes.
- Collaborate with NVIDIA leadership, senior engineers, program managers, and product managers to develop compelling IT products and services that meet customer needs.
Preferred Qualifications
- Solid understanding of microservices architecture, infrastructure as code (IaC) and configuration management tools.
- Understanding of AI ops and how to leverage LLMs to automate various optimization initiatives.