Skip to content

Machine Learning Engineer – Staff – Model Factory
Company | d-Matrix |
---|
Location | Santa Clara, CA, USA |
---|
Salary | $155000 – $250000 |
---|
Type | Full-Time |
---|
Degrees | Bachelor’s, Master’s |
---|
Experience Level | Senior, Expert or higher |
---|
Requirements
- BS in Computer Science with 7+ years or MS in Computer Science with 4+ years
- Strong programming skills in Python and experience with ML frameworks like PyTorch, TensorFlow, or JAX
- Hands-on experience with model optimization, quantization, and inference acceleration
- Deep understanding of Transformer architectures, attention mechanisms, and distributed inference (Tensor Parallel, Pipeline Parallel, Sequence Parallel)
- Knowledge of quantization (INT8, BF16, FP16) and memory-efficient inference techniques
- Solid grasp of software engineering best practices, including CI/CD, containerization (Docker, Kubernetes), and MLOps
- Strong problem-solving skills and ability to work in a fast-paced, iterative development environment
Responsibilities
- Design, build, and optimize machine learning deployment pipelines for large-scale models
- Implement and enhance model inference frameworks
- Develop automated workflows for model development, experimentation, and deployment
- Collaborate with research, architecture, and engineering teams to improve model performance and efficiency
- Work with distributed computing frameworks (e.g., PyTorch/XLA, JAX, TensorFlow, Ray) to optimize model parallelism and deployment
- Implement scalable KV caching and memory-efficient inference techniques for transformer-based models
- Monitor and optimize infrastructure performance across different levels of custom hardware hierarchy – cards, servers, and racks, which are powered by the d-Matrix custom AI chips
- Ensure best practices in ML model versioning, evaluation, and monitoring
Preferred Qualifications
- Experience working with cloud-based ML pipelines (AWS, GCP, or Azure)
- Experience with LLM fine-tuning, LoRA, PEFT, and KV cache optimizations
- Contributions to open-source ML projects or research publications
- Experience with low-level optimizations using CUDA, Triton, or XLA