Site Reliability Engineer
Company | Charles Schwab |
---|---|
Location | Lone Tree, CO, USA, Austin, TX, USA, Westlake, TX, USA, Ann Arbor, MI, USA, Omaha, NE, USA |
Salary | $Not Provided – $Not Provided |
Type | Full-Time |
Degrees | |
Experience Level | Mid Level, Senior |
Requirements
- 4+ years of experience with large-scale enterprise system administration, application support or incident handling
- 4+ years of experience of RHEL Linux administration or Windows server administration
- 4+ years of experience with proven track record of supporting enterprise production environment while adhering to various DevOps & SRE frameworks
- 4+ years of experience building application dashboards for proactive monitoring, setting up Alerts, etc.
- 4+ years of experience with logging/application monitoring tools (AppDynamics, Splunk, Dynatrace, Thousand Eyes)
- 2+ years of experience supporting applications on Cloud operations such as GCP and Pivotal Cloud Foundry (PCF)
- 3+ years of experience using Atlassian tools Jira, Confluence, Bamboo
Responsibilities
- Practice Site Reliability Engineering mindset and solve problems through automation, instrumentation, and simplicity
- Partner with the Architects, Development Leads, Business Partners and other SREs in the team, to ensure implementations are architected and designed from the aspect of resiliency
- Identify applications reliability and availability improvements, establish, and build solutions to continue to drive an improved experience
- Perform production support, application deployments and provide a rapid response for critical trading applications
- Proactively perform system monitoring, and review SLO / SLI Metrics and runbooks
- Implement and collaborate on solutions that increase the monitoring and observability of systems at scale
- Work with development teams to provide recommendations about system health upgrades and toil reduction
- Advocate for Schwab’s Reliability Engineering principles, guidelines, and standards
- Foster a culture of learning through education and knowledge sharing around reliability practices, processes, and tools
- Participate in On-Call escalations during Market and off-hours
Preferred Qualifications
- Experience researching and building dashboards for Grafana and Prometheus
- Experience with Google Cloud Anthos and Kubernetes
- Strong understanding & experience of Platform as a Service (PaaS) and Infrastructure as a Service (IaaS) such as Pivotal Cloud Foundry (PCF)
- Experience with Continuous Integration/Continuous Delivery pipelines (CI/CD)
- Understanding of High Availability Enterprise systems and leveraging tools to automate proactively and eventually predictive availability solutions
- Receptive, approachable teammate, with the ability to positively interact with business partners, technology teams, offshore, and professional services
- Strong advocate with excellent written and verbal communication skills