Posted in

GPU Cluster Resource Scheduling and Optimization Engineer

GPU Cluster Resource Scheduling and Optimization Engineer

CompanyTogether AI
LocationSan Francisco, CA, USA
Salary$160000 – $230000
TypeFull-Time
Degrees
Experience LevelSenior

Requirements

  • 5+ years of experience in resource scheduling, distributed systems, or large-scale machine learning infrastructure.
  • Proficiency in distributed computing frameworks (e.g., Kubernetes, Slurm, Ray).
  • Expertise in designing and implementing resource allocation algorithms and scheduling frameworks.
  • Hands-on experience with cloud platforms (e.g., AWS, GCP, Azure) and GPU orchestration.
  • Proficient in Python, C++, or Go for building high-performance systems.
  • Strong understanding of operational research techniques, such as linear programming, graph algorithms, or evolutionary strategies.
  • Analytical mindset with a focus on problem-solving and performance tuning.
  • Excellent collaboration and communication skills across teams.

Responsibilities

  • Develop and implement intelligent scheduling algorithms tailored for distributed AI workloads on multi-cluster and multi-tenant environments.
  • Ensure efficient allocation of GPUs, TPUs, and CPUs across diverse workloads, balancing resource utilization and job performance.
  • Design optimization techniques for dynamic resource allocation, addressing real-time variations in workload demand.
  • Implement load balancing, job preemption, and task placement strategies to maximize throughput and minimize latency.
  • Build systems that efficiently scale to thousands of nodes and petabytes of data.
  • Optimize training and inference pipelines to reduce runtime and cost while maintaining accuracy and reliability.
  • Build tools for real-time monitoring and diagnostics of resource utilization, job scheduling efficiency, and bottlenecks.
  • Leverage telemetry data and machine learning models for predictive analytics and proactive optimization.
  • Collaborate with researchers, data scientists, and platform engineers to understand workload requirements and align resource management solutions.
  • Stay updated with the latest trends in distributed systems, AI model training, and cloud-native technologies.

Preferred Qualifications

  • Experience with AI/ML frameworks (e.g., TensorFlow, PyTorch, JAX).
  • Familiarity with AI-specific workloads like DDP, sharded training, or reinforcement learning.
  • Knowledge of auto-scaling and cost-optimization strategies in cloud environments.
  • Contributions to open-source scheduling or orchestration projects.