Software Engineer – Machine Learning Infrastructure
Company | Character.AI |
---|---|
Location | Menlo Park, CA, USA, New York, NY, USA |
Salary | $150000 – $350000 |
Type | Full-Time |
Degrees | |
Experience Level | Mid Level, Senior |
Requirements
- 4+ years of experience supporting the infrastructure within an ML environment
- Experience in developing tools used to diagnose ML infrastructure problems and failures
- Experience with cloud platforms (e.g., Compute Engine, Kubernetes, Cloud Storage)
- Experience working with GPUs
Responsibilities
- Provide infrastructure support to our ML research and product
- Build tooling to diagnose cluster issues and hardware failures
- Monitor deployments, manage experiments, and generally support our research
- Maximize GPU allocation and utilization for both serving and training
Preferred Qualifications
- Experience with large GPU clusters and high-performance computing/networking
- Experience with supporting large language model training
- Experience with ML frameworks like Pytorch/TensorFlow/JAX
- Experience with GPU kernel development