Senior Director - Operations and Reliability Engineering

Senior Director – Operations and Reliability Engineering

Company	Boston Consulting Group
Location	Boston, MA, USA, Stratford, London, UK
Salary	$Not Provided – $Not Provided
Type	Full-Time
Degrees
Experience Level	Senior, Expert or higher

Requirements

15+ years of experience in IT operations, SRE, DevOps, or platform engineering.
5+ years in a senior leadership role, managing large-scale IT environments.
Deep technical expertise in cloud computing (AWS, Azure, GCP), on-prem infrastructure, and hybrid environments.
Proven track record in end-to-end automation, Infrastructure as Code (IaC), and large-scale observability.
Experience in AI-driven IT operations, predictive analytics, and automated remediation.
Strong understanding of zero-trust security, regulatory compliance, and risk management.
Excellent leadership, communication, and stakeholder management skills.

Responsibilities

Define and execute a modern Reliability Engineering strategy, integrating SRE, DevOps, and automation-first operational models.
Drive end-to-end automation to eliminate toil, improve efficiency, and enhance operational resilience.
Lead the transition from traditional IT operations to a proactive, AI-driven, self-healing infrastructure.
Establish a global observability, telemetry, and predictive analytics framework for real-time insights.
Align operational strategies with business goals, ensuring IT supports digital transformation initiatives across BCG Core, BCG X, and CT.
Oversee global IT infrastructure, cloud platforms, and hybrid hosting environments across all BCG business units.
Manage network reliability, compute platforms, and cloud-native services across AWS, Azure, and GCP.
Scale Infrastructure as Code (IaC), automated provisioning, and cloud workload optimization.
Drive edge computing, containerized workloads, and high-performance computing strategies.
Implement AI-driven monitoring, self-healing automation, and full-stack observability.
Mandate and assure the adoption of IT Service Management (ITSM) processes across all teams, ensuring standardized, efficient, and effective service delivery.
Establish SRE-based operational metrics, including SLOs, SLIs, and error budgets.
Oversee incident response, problem resolution, and root cause analysis with AI-driven remediation.
Ensure high availability, performance, and security compliance for all enterprise services.
Develop a follow-the-sun operational support model, ensuring 24×7 resilience and uptime across all of BCG.
Optimize incident, change, and capacity management, ensuring alignment with ITIL best practices and automated workflows.
Lead Service Asset and Configuration Management (SACM), ensuring accurate and real-time management of software and IT assets within the CMDB.
Drive continuous enhancements to the CMDB, improving visibility, compliance, and lifecycle management of IT assets.
Embed security and compliance into operational workflows with automated security controls.
Ensure adherence to ISO 27001, NIST, SOC 2, GDPR, and cloud security best practices.
Collaborate with cybersecurity teams to integrate zero-trust security models.
Drive resiliency planning, disaster recovery, and business continuity initiatives.
Optimize IT operational budgets with a cost-effective, cloud-native strategy.
Negotiate vendor contracts, ensuring alignment with business needs and service reliability.
Drive cost efficiency in cloud spending, SaaS platforms, and infrastructure investments.
Build and mentor a high-performing Reliability Engineering team, fostering a culture of automation and innovation.
Lead a team of SREs, DevOps engineers, and platform reliability experts across global squads.
Promote a collaborative, data-driven, and proactive mindset, ensuring agility and operational resilience.
Establish workforce development programs for AI-driven operations, automation, and modern reliability practices.

Preferred Qualifications

Certifications: ITIL, AWS/Azure/GCP Solutions Architect, SRE Foundation, CISSP, or equivalent.
Experience with Kubernetes, Terraform, Ansible, and AI-powered operations tools.
Strong problem-solving abilities, with a data-driven approach to operational excellence.