Software Engineer – Model Performance
Company | Baseten |
---|---|
Location | San Francisco, CA, USA, New York, NY, USA |
Salary | $Not Provided – $Not Provided |
Type | Full-Time |
Degrees | Bachelor’s, Master’s, PhD |
Experience Level | Junior, Mid Level |
Requirements
- Bachelor’s, Master’s, or Ph.D. degree in Computer Science, Engineering, Mathematics, or related field.
- Experience with one or more general-purpose programming languages, such as Python or C++.
- Familiarity with LLM optimization techniques (e.g., quantization, speculative decoding, continuous batching).
- Strong familiarity with ML libraries, especially PyTorch, TensorRT, or TensorRT-LLM.
- Demonstrated interest and experience in LLM’s.
- Deep understanding of GPU architecture.
Responsibilities
- Implement, refine, and productionize cutting-edge techniques (quantization, speculative decoding, kv cache reuse, chunked prefill and LoRA) for ML model inference and infrastructure.
- Deep dive into underlying codebases of TensorRT, PyTorch, TensorRT-LLM, vllm, sglang, CUDA, and other libraries to debug ML performance issues.
- Apply and scale optimization techniques across a wide range of ML models, particularly large language models.
- Collaborate with a diverse team to design and implement innovative solutions.
- Own projects from idea to production.
Preferred Qualifications
- Proficiency in enhancing the performance of software systems, particularly in the context of large language models (LLMs).
- Experience with CUDA or similar technologies.
- Deep understanding of software engineering principles and a proven track record of developing and deploying AI/ML inference solutions.
- Experience with Docker and Kubernetes.