Staff Software Engineer - Machine Learning Infrastructure - Slack

Staff Software Engineer – Machine Learning Infrastructure – Slack

5+ years experience with software engineering, which includes 3+ years in machine learning
Built large-scale, distributed, production ML/AI systems professionally
Worked on complex issues requiring in-depth knowledge of the company and existing architecture
Familiarity with modern methodologies for unit tests, code review, design documentation, debugging, and troubleshooting
Experience developing, monitoring, and deploying systems in cloud environments like AWS, GCP, and Azure
Experience with ops tools and frameworks such as Terraform, Chef, and Kubernetes
Experience with ML model serving frameworks/toolkits like Kubeflow, MLflow, AWS Bedrock and SageMaker
Experience with functional or imperative programming languages: PHP, Python, Ruby, Go, C, Scala or Java
Experience with Grafana, Prometheus, Honeycomb, or other monitoring software

Managing deployments of machine learning models in our own kubernetes-based deployment system and through AWS Bedrock and SageMaker, working with tools like Chef, Hashicorp Terraform, and KubeRay
Optimizing our models and infrastructure to reduce latency and handle spikes in traffic
Constantly evaluating and improving our infrastructure to maximize efficiency and minimize costs
Setting up our model training infrastructure to fine tune embedding models while keeping our customer’s data secure
Working with our search team to generate embeddings at scale to power semantic search and enterprise search
Working with our ML Modeling and AI teams to support development of AI features and deployment at scale
Building and supporting an AI Platform
Supporting 24/7 on-call rotation

You’re analytical and data driven
Experience developing machine learning models in PyTorch, TensorFlow, XGBoost, Scikit-learn or similar
Experience with building data pipelines in Airflow, Spark, and similar
Experience with vector based retrieval like through Vespa, Milvus, or Solr