Posted in

Site Reliability Engineer – High Performance Computing / AI-ML

Site Reliability Engineer – High Performance Computing / AI-ML

CompanyX
LocationPalo Alto, CA, USA, Seattle, WA, USA, Austin, TX, USA, San Jose, CA, USA, New York, NY, USA
Salary$120000 – $297000
TypeFull-Time
Degrees
Experience LevelJunior, Mid Level

Requirements

  • 2+ years of professional software development experience
  • Extensive experience with Kubernetes and container orchestration
  • Proficiency in one or more object-oriented programming languages (e.g. Python, Java, C++, Scala)
  • Proficiency in scripting languages (Python, Bash, etc.)
  • Strong experience in configuration management (e.g., puppet, ansible, chef, etc.)
  • Familiarity with Ethernet networking at scale and distributed systems
  • Strong troubleshooting skills and experience with HPC environments
  • Experience managing large-scale systems, ideally supporting thousands of machines
  • Working understanding of the storage systems required to support such environments
  • Experience with various GPU / accelerator architectures and ability to optimize performance on such platforms
  • Ability to think outside the box and come up with innovative solutions to complicated problems
  • Extremely committed, willing to work in a fast paced environment
  • Excellent communication and interpersonal skills

Responsibilities

  • Managing and troubleshooting large scale clusters to ensure the stability and efficiency of our platform (primarily Linux + Kubernetes)
  • Collaborating with cross-functional teams, including hardware engineers and software developers, to support and improve our infrastructure
  • Automating the provisioning and deployment of systems to enhance long-term health and scalability
  • Ensuring the robustness of our HPC environments and storage clusters
  • Writing and maintaining scripts and tools for automation and monitoring
  • Addressing system failures and performance issues, identifying root causes, and implementing preventive measures
  • Working closely with end-users to understand changing needs as our environment evolves.

Preferred Qualifications

    No preferred qualifications provided.