Research Engineer - Tokens ML Infra

Research Engineer – Tokens ML Infra

Strong software engineering skills with experience in building distributed systems
Expertise in Python and experience with distributed computing frameworks
Deep understanding of cloud computing platforms and distributed systems architecture
Experience with high-throughput, fault-tolerant system design
Strong background in performance optimization and system scaling
Excellent problem-solving skills and attention to detail
Strong communication skills and ability to work in a collaborative environment

Design and implement high-performance ML training infrastructure for large language model research
Develop and maintain core ML framework primitives in JAX, PyTorch, etc.
Create robust automated evaluation and benchmarking systems for model performance
Implement comprehensive monitoring and debugging tools for ML workflows
Design and optimize data loading pipelines that maximize training throughput
Build MLOps tooling to support reproducible research and experimentation
Collaborate with research teams to prototype and scale novel training architectures
Develop infrastructure for efficient hyperparameter sweeps and architecture search