Posted in

Site Reliability Engineer

Site Reliability Engineer

CompanyCharles Schwab
LocationLone Tree, CO, USA, Austin, TX, USA, Westlake, TX, USA, Ann Arbor, MI, USA, Omaha, NE, USA
Salary$Not Provided – $Not Provided
TypeFull-Time
Degrees
Experience LevelMid Level, Senior

Requirements

  • 4+ years of experience with large-scale enterprise system administration, application support or incident handling
  • 4+ years of experience of RHEL Linux administration or Windows server administration
  • 4+ years of experience with proven track record of supporting enterprise production environment while adhering to various DevOps & SRE frameworks
  • 4+ years of experience building application dashboards for proactive monitoring, setting up Alerts, etc.
  • 4+ years of experience with logging/application monitoring tools (AppDynamics, Splunk, Dynatrace, Thousand Eyes)
  • 2+ years of experience supporting applications on Cloud operations such as GCP and Pivotal Cloud Foundry (PCF)
  • 3+ years of experience using Atlassian tools Jira, Confluence, Bamboo

Responsibilities

  • Practice Site Reliability Engineering mindset and solve problems through automation, instrumentation, and simplicity
  • Partner with the Architects, Development Leads, Business Partners and other SREs in the team, to ensure implementations are architected and designed from the aspect of resiliency
  • Identify applications reliability and availability improvements, establish, and build solutions to continue to drive an improved experience
  • Perform production support, application deployments and provide a rapid response for critical trading applications
  • Proactively perform system monitoring, and review SLO / SLI Metrics and runbooks
  • Implement and collaborate on solutions that increase the monitoring and observability of systems at scale
  • Work with development teams to provide recommendations about system health upgrades and toil reduction
  • Advocate for Schwab’s Reliability Engineering principles, guidelines, and standards
  • Foster a culture of learning through education and knowledge sharing around reliability practices, processes, and tools
  • Participate in On-Call escalations during Market and off-hours

Preferred Qualifications

  • Experience researching and building dashboards for Grafana and Prometheus
  • Experience with Google Cloud Anthos and Kubernetes
  • Strong understanding & experience of Platform as a Service (PaaS) and Infrastructure as a Service (IaaS) such as Pivotal Cloud Foundry (PCF)
  • Experience with Continuous Integration/Continuous Delivery pipelines (CI/CD)
  • Understanding of High Availability Enterprise systems and leveraging tools to automate proactively and eventually predictive availability solutions
  • Receptive, approachable teammate, with the ability to positively interact with business partners, technology teams, offshore, and professional services
  • Strong advocate with excellent written and verbal communication skills