Posted in

Senior AI Infrastructure Engineer

Senior AI Infrastructure Engineer

CompanyWex
LocationBoston, MA, USA, San Francisco, CA, USA, Chicago, IL, USA, Portland, ME, USA
Salary$126000 – $168000
TypeFull-Time
DegreesBachelor’s
Experience LevelSenior

Requirements

  • Bachelor’s degree in Computer Science, Software Engineering, or a related field. OR demonstrable equivalent deep understanding, experience, and capability.
  • 4+ years of experience in software engineering or cloud infrastructure, with a strong focus on AI/ML infrastructure.
  • Demonstrable advanced programming skills in a 3GL strongly-typed language like Java, Python, C/C++ or Golang.
  • Strong understanding of cloud platforms (AWS and Azure), including services relevant to AI/ML (e.g., EC2, S3, EKS, Azure ML, AKS).
  • Hands-on experience with containerization (Docker) and container orchestration (Kubernetes) in production environments.
  • Extensive experience in building and managing CI/CD pipelines for infrastructure and ML model deployment (using tools like Jenkins, GitLab CI/CD, etc.).
  • Strong understanding of networking concepts (VPC, subnets, routing, firewalls) and experience configuring network infrastructure in the cloud.
  • Experience with infrastructure monitoring and alerting tools (e.g., Prometheus, Grafana, CloudWatch, Azure Monitor).
  • Strong scripting skills (Python, Bash) for automation and configuration management.
  • Excellent problem-solving skills, with the ability to analyze complex systems and identify performance bottlenecks.
  • Strong communication and collaboration skills, with the ability to work effectively in a team environment.

Responsibilities

  • Collaborate with data scientists, ML engineers, and stakeholders to understand the requirements and challenges of AI/ML workloads.
  • Design, implement, and maintain scalable and secure cloud infrastructure on AWS and Azure to support AI/ML workloads using IaC technologies like Terraform.
  • Manage containerization (Docker) and orchestration (Kubernetes) for efficient deployment and scaling of AI/ML applications.
  • Develop and optimize CI/CD pipelines for automating the build, test, and deployment of AI/ML models and infrastructure.
  • Implement robust monitoring and alerting systems to ensure the health, performance, and reliability of production AI infrastructure.
  • Proactively analyze system performance data to identify bottlenecks, optimize resource utilization, and improve overall efficiency.
  • Stay current with emerging cloud technologies, tools, and best practices in the AI/ML infrastructure space.
  • Mentor and guide junior team members, fostering a culture of continuous learning and knowledge sharing.
  • Contribute to the team’s technical roadmap and strategic initiatives.
  • Troubleshoot complex technical issues and provide timely solutions.
  • Participate in on-call rotation to ensure 24/7 availability and support of critical AI infrastructure.
  • Advocate for your positions while fully supporting team decisions.

Preferred Qualifications

  • Experience with MLOps tools and practices.
  • Familiarity with infrastructure as code (IaC) tools like Terraform or CloudFormation.
  • Contributions to open-source projects related to AI/ML infrastructure.
  • Experience with big data technologies (e.g., Hadoop, Spark) is a plus.