Site Reliability Engineer – High Performance Computing / AI-ML
Company | X |
---|---|
Location | Palo Alto, CA, USA, Seattle, WA, USA, Austin, TX, USA, San Jose, CA, USA, New York, NY, USA |
Salary | $120000 – $297000 |
Type | Full-Time |
Degrees | |
Experience Level | Junior, Mid Level |
Requirements
- 2+ years of professional software development experience
- Extensive experience with Kubernetes and container orchestration
- Proficiency in one or more object-oriented programming languages (e.g. Python, Java, C++, Scala)
- Proficiency in scripting languages (Python, Bash, etc.)
- Strong experience in configuration management (e.g., puppet, ansible, chef, etc.)
- Familiarity with Ethernet networking at scale and distributed systems
- Strong troubleshooting skills and experience with HPC environments
- Experience managing large-scale systems, ideally supporting thousands of machines
- Working understanding of the storage systems required to support such environments
- Experience with various GPU / accelerator architectures and ability to optimize performance on such platforms
- Ability to think outside the box and come up with innovative solutions to complicated problems
- Extremely committed, willing to work in a fast paced environment
- Excellent communication and interpersonal skills
Responsibilities
- Managing and troubleshooting large scale clusters to ensure the stability and efficiency of our platform (primarily Linux + Kubernetes)
- Collaborating with cross-functional teams, including hardware engineers and software developers, to support and improve our infrastructure
- Automating the provisioning and deployment of systems to enhance long-term health and scalability
- Ensuring the robustness of our HPC environments and storage clusters
- Writing and maintaining scripts and tools for automation and monitoring
- Addressing system failures and performance issues, identifying root causes, and implementing preventive measures
- Working closely with end-users to understand changing needs as our environment evolves.
Preferred Qualifications
-
No preferred qualifications provided.