Staff Software Engineer – Machine Learning Infrastructure – Slack
Company | Salesforce |
---|---|
Location | Dallas, TX, USA, Atlanta, GA, USA |
Salary | $Not Provided – $Not Provided |
Type | Full-Time |
Degrees | |
Experience Level | Senior |
Requirements
- 5+ years experience with software engineering, which includes 3+ years in machine learning
- Built large-scale, distributed, production ML/AI systems professionally
- Worked on complex issues requiring in-depth knowledge of the company and existing architecture
- Familiarity with modern methodologies for unit tests, code review, design documentation, debugging, and troubleshooting
- Experience developing, monitoring, and deploying systems in cloud environments like AWS, GCP, and Azure
- Experience with ops tools and frameworks such as Terraform, Chef, and Kubernetes
- Experience with ML model serving frameworks/toolkits like Kubeflow, MLflow, AWS Bedrock and SageMaker
- Experience with functional or imperative programming languages: PHP, Python, Ruby, Go, C, Scala or Java
- Experience with Grafana, Prometheus, Honeycomb, or other monitoring software
Responsibilities
- Managing deployments of machine learning models in our own kubernetes-based deployment system and through AWS Bedrock and SageMaker, working with tools like Chef, Hashicorp Terraform, and KubeRay
- Optimizing our models and infrastructure to reduce latency and handle spikes in traffic
- Constantly evaluating and improving our infrastructure to maximize efficiency and minimize costs
- Setting up our model training infrastructure to fine tune embedding models while keeping our customer’s data secure
- Working with our search team to generate embeddings at scale to power semantic search and enterprise search
- Working with our ML Modeling and AI teams to support development of AI features and deployment at scale
- Building and supporting an AI Platform
- Supporting 24/7 on-call rotation
Preferred Qualifications
- You’re analytical and data driven
- Experience developing machine learning models in PyTorch, TensorFlow, XGBoost, Scikit-learn or similar
- Experience with building data pipelines in Airflow, Spark, and similar
- Experience with vector based retrieval like through Vespa, Milvus, or Solr