Posted in

Développeur senior d’opérations d’apprentissage automatique – Optimisation d’inférence/ Senior Mach… – Inference Optimization

Développeur senior d’opérations d’apprentissage automatique – Optimisation d’inférence/ Senior Mach… – Inference Optimization

CompanyCerence
LocationMontreal, QC, Canada
Salary$Not Provided – $Not Provided
TypeFull-Time
DegreesMaster’s, PhD
Experience LevelSenior

Requirements

  • 6+ years of experience in software engineering, with a focus on AI/ML.
  • Deep expertise in AI model optimization techniques, including quantization, pruning, knowledge distillation, and hardware-aware model design.
  • Proficiency in programming languages such as Python, C++, or Rust.
  • Experience with AI/ML frameworks such as TensorFlow, PyTorch, and ONNX.
  • Hands-on experience with GPU/TPU acceleration and deployment in cloud and edge environments.
  • Strong DevOps mindset with experience in Kubernetes, containers, deployments, dashboards, high availability, autoscaling, metrics, and logs.
  • Strong problem-solving skills and the ability to make data-driven decisions.
  • Excellent communication skills and the ability to articulate complex technical concepts to a diverse audience.

Responsibilities

  • Design, develop, and implement strategies to optimize AI/ML inference pipelines for performance, scalability, and cost efficiency.
  • Collaborate closely with other Principal and Senior Engineers on the team, fostering a culture of knowledge-sharing and joint problem-solving.
  • Work with cross-functional teams, including MLOps, data science, and software engineering, to integrate optimized inference solutions into production environments.
  • Drive innovation in hardware acceleration, quantization, model compression, and distributed inference techniques.
  • Stay up-to-date with LLM hosting frameworks and their configuration on both machine and cluster levels (e.g., vLLM, TensorRT, KubeFlow).
  • Optimize systems using techniques such as batching, caching, and speculative decoding.
  • Conduct performance tuning, benchmarking, and profiling for inference systems, with expertise in memory management, threading, concurrency, and GPU optimization.
  • Manage model repositories, artifact delivery, and related infrastructure.
  • Develop and maintain logging mechanisms for diagnostics and research purposes.

Preferred Qualifications

  • Experience with Kubernetes, Docker, and CI/CD pipelines for AI/ML workloads.
  • Familiarity with MLOps practices and tools, including model versioning and monitoring.
  • Familiarity with performance tuning of inference engines like vLLM and techniques such as LoRA adapters.
  • Understanding of LLM architecture and optimization.
  • Contributions to open-source AI/ML projects.
  • Familiarity with automotive or transportation industry applications.
  • Master’s or Ph.D. in Computer Science, Machine Learning, or a related field.