Senior AI Infrastructure Engineer
Company | Wex |
---|---|
Location | Boston, MA, USA, San Francisco, CA, USA, Chicago, IL, USA, Portland, ME, USA |
Salary | $126000 – $168000 |
Type | Full-Time |
Degrees | Bachelor’s |
Experience Level | Senior |
Requirements
- Bachelor’s degree in Computer Science, Software Engineering, or a related field. OR demonstrable equivalent deep understanding, experience, and capability.
- 4+ years of experience in software engineering or cloud infrastructure, with a strong focus on AI/ML infrastructure.
- Demonstrable advanced programming skills in a 3GL strongly-typed language like Java, Python, C/C++ or Golang.
- Strong understanding of cloud platforms (AWS and Azure), including services relevant to AI/ML (e.g., EC2, S3, EKS, Azure ML, AKS).
- Hands-on experience with containerization (Docker) and container orchestration (Kubernetes) in production environments.
- Extensive experience in building and managing CI/CD pipelines for infrastructure and ML model deployment (using tools like Jenkins, GitLab CI/CD, etc.).
- Strong understanding of networking concepts (VPC, subnets, routing, firewalls) and experience configuring network infrastructure in the cloud.
- Experience with infrastructure monitoring and alerting tools (e.g., Prometheus, Grafana, CloudWatch, Azure Monitor).
- Strong scripting skills (Python, Bash) for automation and configuration management.
- Excellent problem-solving skills, with the ability to analyze complex systems and identify performance bottlenecks.
- Strong communication and collaboration skills, with the ability to work effectively in a team environment.
Responsibilities
- Collaborate with data scientists, ML engineers, and stakeholders to understand the requirements and challenges of AI/ML workloads.
- Design, implement, and maintain scalable and secure cloud infrastructure on AWS and Azure to support AI/ML workloads using IaC technologies like Terraform.
- Manage containerization (Docker) and orchestration (Kubernetes) for efficient deployment and scaling of AI/ML applications.
- Develop and optimize CI/CD pipelines for automating the build, test, and deployment of AI/ML models and infrastructure.
- Implement robust monitoring and alerting systems to ensure the health, performance, and reliability of production AI infrastructure.
- Proactively analyze system performance data to identify bottlenecks, optimize resource utilization, and improve overall efficiency.
- Stay current with emerging cloud technologies, tools, and best practices in the AI/ML infrastructure space.
- Mentor and guide junior team members, fostering a culture of continuous learning and knowledge sharing.
- Contribute to the team’s technical roadmap and strategic initiatives.
- Troubleshoot complex technical issues and provide timely solutions.
- Participate in on-call rotation to ensure 24/7 availability and support of critical AI infrastructure.
- Advocate for your positions while fully supporting team decisions.
Preferred Qualifications
- Experience with MLOps tools and practices.
- Familiarity with infrastructure as code (IaC) tools like Terraform or CloudFormation.
- Contributions to open-source projects related to AI/ML infrastructure.
- Experience with big data technologies (e.g., Hadoop, Spark) is a plus.