Posted in

Senior Director – Operations and Reliability Engineering

Senior Director – Operations and Reliability Engineering

CompanyBoston Consulting Group
LocationBoston, MA, USA, Stratford, London, UK
Salary$Not Provided – $Not Provided
TypeFull-Time
Degrees
Experience LevelSenior, Expert or higher

Requirements

  • 15+ years of experience in IT operations, SRE, DevOps, or platform engineering.
  • 5+ years in a senior leadership role, managing large-scale IT environments.
  • Deep technical expertise in cloud computing (AWS, Azure, GCP), on-prem infrastructure, and hybrid environments.
  • Proven track record in end-to-end automation, Infrastructure as Code (IaC), and large-scale observability.
  • Experience in AI-driven IT operations, predictive analytics, and automated remediation.
  • Strong understanding of zero-trust security, regulatory compliance, and risk management.
  • Excellent leadership, communication, and stakeholder management skills.

Responsibilities

  • Define and execute a modern Reliability Engineering strategy, integrating SRE, DevOps, and automation-first operational models.
  • Drive end-to-end automation to eliminate toil, improve efficiency, and enhance operational resilience.
  • Lead the transition from traditional IT operations to a proactive, AI-driven, self-healing infrastructure.
  • Establish a global observability, telemetry, and predictive analytics framework for real-time insights.
  • Align operational strategies with business goals, ensuring IT supports digital transformation initiatives across BCG Core, BCG X, and CT.
  • Oversee global IT infrastructure, cloud platforms, and hybrid hosting environments across all BCG business units.
  • Manage network reliability, compute platforms, and cloud-native services across AWS, Azure, and GCP.
  • Scale Infrastructure as Code (IaC), automated provisioning, and cloud workload optimization.
  • Drive edge computing, containerized workloads, and high-performance computing strategies.
  • Implement AI-driven monitoring, self-healing automation, and full-stack observability.
  • Mandate and assure the adoption of IT Service Management (ITSM) processes across all teams, ensuring standardized, efficient, and effective service delivery.
  • Establish SRE-based operational metrics, including SLOs, SLIs, and error budgets.
  • Oversee incident response, problem resolution, and root cause analysis with AI-driven remediation.
  • Ensure high availability, performance, and security compliance for all enterprise services.
  • Develop a follow-the-sun operational support model, ensuring 24×7 resilience and uptime across all of BCG.
  • Optimize incident, change, and capacity management, ensuring alignment with ITIL best practices and automated workflows.
  • Lead Service Asset and Configuration Management (SACM), ensuring accurate and real-time management of software and IT assets within the CMDB.
  • Drive continuous enhancements to the CMDB, improving visibility, compliance, and lifecycle management of IT assets.
  • Embed security and compliance into operational workflows with automated security controls.
  • Ensure adherence to ISO 27001, NIST, SOC 2, GDPR, and cloud security best practices.
  • Collaborate with cybersecurity teams to integrate zero-trust security models.
  • Drive resiliency planning, disaster recovery, and business continuity initiatives.
  • Optimize IT operational budgets with a cost-effective, cloud-native strategy.
  • Negotiate vendor contracts, ensuring alignment with business needs and service reliability.
  • Drive cost efficiency in cloud spending, SaaS platforms, and infrastructure investments.
  • Build and mentor a high-performing Reliability Engineering team, fostering a culture of automation and innovation.
  • Lead a team of SREs, DevOps engineers, and platform reliability experts across global squads.
  • Promote a collaborative, data-driven, and proactive mindset, ensuring agility and operational resilience.
  • Establish workforce development programs for AI-driven operations, automation, and modern reliability practices.

Preferred Qualifications

  • Certifications: ITIL, AWS/Azure/GCP Solutions Architect, SRE Foundation, CISSP, or equivalent.
  • Experience with Kubernetes, Terraform, Ansible, and AI-powered operations tools.
  • Strong problem-solving abilities, with a data-driven approach to operational excellence.