Posted in

Principal AI Infrastructure SRE Engineer

Principal AI Infrastructure SRE Engineer

CompanyNVIDIA
LocationSanta Clara, CA, USA
Salary$248000 – $391000
TypeFull-Time
DegreesBachelor’s
Experience LevelExpert or higher

Requirements

  • Bachelor’s degree in Engineering, Computer Science, Mathematics, or related field, or equivalent experience.
  • 15+ years of proven experience in compute platform engineering with a focus on automation.
  • Experience with design, deployment and operation of infrastructure that supports AI and SW development at scale including Kubernetes, integrating modern AI Data infrastructure platforms into Kubernetes workloads.
  • Proven experience integrating existing application architectures and build new identify opportunities for containerization to improve scalability, reliability, and efficiency.
  • Proficiency in programming languages such as Go and/or Python. Experience in developing tools for data analysis and performance profiling, Development with Terraform, Config Management tools.
  • Experience with designing and running large environments consisting of BareMetal servers/virtualized environments with a mix of tens of thousands of VMs and cloud infrastructure or AI infrastructure.
  • Deep understanding of other infrastructure components like Storage, DNS, AD, Security Tools etc..

Responsibilities

  • Lead initiatives to transform IT Infrastructure platform architecture and services On-Prem for modern AI workloads and AI semi conductor and software development.
  • Collaborate with partners to design architecture, Build & Operate platforms that transform Storage, Compute & Middleware with modern security paradigms.
  • Build software and automation to run infrastructure at scale with minimal human intervention. Develop and maintain tools for collecting, analyzing, and visualizing data for reporting, alerting, monitoring.
  • Collect and review system data for capacity and planning purposes, analyze capacity data and develop plans for appropriate level enterprise-wide systems, and coordinate with management personnel in implementing changes.
  • Collaborate with NVIDIA leadership, senior engineers, program managers, and product managers to develop compelling IT products and services that meet customer needs.

Preferred Qualifications

  • Solid understanding of microservices architecture, infrastructure as code (IaC) and configuration management tools.
  • Understanding of AI ops and how to leverage LLMs to automate various optimization initiatives.