Platform engineer – Mlops
Company | Writer |
---|---|
Location | San Francisco, CA, USA |
Salary | $Not Provided – $Not Provided |
Type | Full-Time |
Degrees | |
Experience Level | Senior, Expert or higher |
Requirements
- Model training
- Huggingface Transformers
- Pytorch
- vLLM
- TensorRT
- Infrastructure as code tools like Terraform
- Scripting languages such as Python or Bash
- Cloud platforms such as Google Cloud, AWS or Azure
- Git and GitHub workflows
- Tracing and Monitoring
- Familiar with high-performance, large-scale ML systems
- Proactive in identifying problems, performance bottlenecks, and areas for improvement
- Take pride in building and operating scalable, reliable, secure systems
- Familiar with monitoring tools such as Prometheus, Grafana, or similar
- Comfortable with ambiguity and rapid change
Responsibilities
- Work closely with AI/ML engineers and researchers to design and deploy a CI/CD pipeline that ensures safe and reproducible experiments.
- Set up and manage monitoring, logging, and alerting systems for extensive training runs and client-facing APIs.
- Ensure training environments are consistently available and prepared across multiple clusters.
- Develop and manage containerization and orchestration systems utilizing tools such as Docker and Kubernetes.
- Operate and oversee large Kubernetes clusters with GPU workloads.
- Improve reliability, quality, and time-to-market of our suite of software solutions
- Measure and optimize system performance, with an eye toward pushing our capabilities forward, getting ahead of customer needs, and innovating for continual improvement
- Provide primary operational support and engineering for multiple large-scale distributed software applications
Preferred Qualifications
- Familiar with monitoring tools such as Prometheus, Grafana, or similar
- 5+ years building core infrastructure
- Experience running inference clusters at scale
- Experience operating orchestration systems such as Kubernetes at scale