Posted in

Software Engineer – Evals Infrastructure – Preparedness

Software Engineer – Evals Infrastructure – Preparedness

CompanyOpenAI
LocationSan Francisco, CA, USA
Salary$325000 – $325000
TypeFull-Time
DegreesBachelor’s
Experience LevelSenior, Expert or higher

Requirements

  • Bachelor’s degree in Computer Science, Information Technology, or a related field (or equivalent work experience)
  • At least 7+ years of professional software engineering experience
  • Proven experience as a reliability engineer or a similar role in a fast-paced, rapidly scaling company
  • Strong proficiency in cloud infrastructure
  • Proficiency in programming/scripting languages
  • Experience with containerization technologies and container orchestration platforms like Kubernetes
  • Knowledge of IaC tools such as Terraform or CloudFormation
  • Excellent problem-solving and troubleshooting skills
  • Strong communication and collaboration skills
  • Experience with observability tools such as DataDog, Prometheus, Grafana, Splunk and ELK stack
  • Experience with microservices architecture and service mesh technologies
  • Knowledge of security best practices in cloud environments

Responsibilities

  • Work on scaling our infrastructure to support a wide variety of evaluations, supporting systems and automation
  • Collaborate with development teams to make our systems more reliable (owning Production Readiness Reviews)
  • Implement and manage monitoring systems to proactively identify issues and anomalies in our production environment
  • Develop and maintain service level objectives (SLOs) and service level indicators (SLIs) to measure and ensure system reliability
  • Implement fault-tolerant and resilient design patterns to minimize service disruptions
  • Build and maintain automation tools to streamline repetitive tasks and improve system reliability
  • Partner with engineers and researchers at OpenAI to help bring frontier research capabilities to the world
  • Participate in an on-call rotation to respond to critical incidents and ensure 24/7 system availability

Preferred Qualifications

  • Enjoy seeking out and addressing bottlenecks and areas for performance improvement in our systems
  • Utilize Infrastructure as Code (IaC) principles to automate infrastructure provisioning and configuration management
  • Experienced in collaborating with cross-functional teams to ensure that reliability and scalability are considered in the design and development of new features and services
  • Track record of accelerating engineering reliability by empowering fellow engineers with excellent tooling and systems
  • Help create a diverse, equitable, and inclusive culture that makes all feel welcome while enabling radical candor and the challenging of group think
  • Humble attitude, eagerness to help colleagues, and a desire to do whatever it takes to make the team succeed
  • Own problems end-to-end, and are willing to pick up whatever knowledge you’re missing to get the job done