Posted in

Platform engineer – Mlops

Platform engineer – Mlops

CompanyWriter
LocationSan Francisco, CA, USA
Salary$Not Provided – $Not Provided
TypeFull-Time
Degrees
Experience LevelSenior, Expert or higher

Requirements

  • Model training
  • Huggingface Transformers
  • Pytorch
  • vLLM
  • TensorRT
  • Infrastructure as code tools like Terraform
  • Scripting languages such as Python or Bash
  • Cloud platforms such as Google Cloud, AWS or Azure
  • Git and GitHub workflows
  • Tracing and Monitoring
  • Familiar with high-performance, large-scale ML systems
  • Proactive in identifying problems, performance bottlenecks, and areas for improvement
  • Take pride in building and operating scalable, reliable, secure systems
  • Familiar with monitoring tools such as Prometheus, Grafana, or similar
  • Comfortable with ambiguity and rapid change

Responsibilities

  • Work closely with AI/ML engineers and researchers to design and deploy a CI/CD pipeline that ensures safe and reproducible experiments.
  • Set up and manage monitoring, logging, and alerting systems for extensive training runs and client-facing APIs.
  • Ensure training environments are consistently available and prepared across multiple clusters.
  • Develop and manage containerization and orchestration systems utilizing tools such as Docker and Kubernetes.
  • Operate and oversee large Kubernetes clusters with GPU workloads.
  • Improve reliability, quality, and time-to-market of our suite of software solutions
  • Measure and optimize system performance, with an eye toward pushing our capabilities forward, getting ahead of customer needs, and innovating for continual improvement
  • Provide primary operational support and engineering for multiple large-scale distributed software applications

Preferred Qualifications

  • Familiar with monitoring tools such as Prometheus, Grafana, or similar
  • 5+ years building core infrastructure
  • Experience running inference clusters at scale
  • Experience operating orchestration systems such as Kubernetes at scale