Posted in

Intermediate Site Reliability Engineer

Intermediate Site Reliability Engineer

CompanyPointClickCare
LocationMississauga, ON, Canada
Salary$109800 – $118100
TypeFull-Time
DegreesBachelor’s
Experience LevelSenior

Requirements

  • Bachelor’s Degree in Computer Science, Software Engineering, or related discipline.
  • Prior experience as a Site Reliability Engineer (SRE) in a previous role. (Minimum 5 years’ experience.)
  • Prior relevant software Development/Architecture/Engineering/DevOPS experience (Minimum 5 years’ experience).
  • Strong experience in building and supporting cloud-based solutions, Azure cloud infrastructure and services knowledge and experience preferred.
  • Experience with virtualization and container solutions such as Docker and Kubernetes.
  • Familiarity with Databricks, Event Hub, Redis, Azure Service Bus, Azure Functions, and Tomcat.
  • Experience with Windows based systems and Linux administration.
  • Experience with configuration management and deployment automation tools (e.g., Chef, Terraform, Puppet, Ansible, Jenkins, Spinnaker, ArgoCD, GitHub Actions).
  • Proficiency in programming languages such as Java, JavaScript and Python.
  • Working knowledge of database technologies (e.g., SQL Server, MySQL, PostgreSQL).
  • Experience with monitoring and logging solutions (e.g., Prometheus, Grafana, ELK stack, AppDynamics, DataDog).
  • Strong debugging and optimization skills, with the ability to automate routine tasks.
  • Systematic problem-solving approach with strong communication skills and a proactive mindset.
  • Knowledge of standard production practices, including change management and incident management (ITIL).
  • Experience building CI/CD pipelines and Blue/Green, Zero Downtime deployment strategies.
  • Troubleshooting experience with diverse hosting technologies, web servers, Java applications, operating systems, network components, and web browsers.

Responsibilities

  • Lead and implement SRE best practices to foster a strong SRE culture.
  • Coach, mentor and develop junior team members to grow into SRE’s.
  • Lead incident response calls to troubleshoot complex system and application-level issues.
  • Lead RCAs to capture lessons learnt and implement innovative solutions to prevent future incidents from re-occurring.
  • Design automated solutions to reduce manual tasks, enhance system reliability and reduce incident response time.
  • Implement and improve monitoring, alerting and logging utilizing tools such as ELK/Kibana, Prometheus, Grafana, AppD and Datadog.
  • Implement, monitor, and report on Service Level Objectives (SLOs) and Service Level Indicators (SLIs) for application services.
  • Collaborate with business and product owners to establish key performance indicators (KPIs).
  • Participate in technical training events, game day scenarios, and chaos engineering.
  • Provide support for a wide range of applications with a focus on increasing automation, repeatability, and consistency as well as self-healing.
  • Proactively improve application and infrastructure resiliency under various error and performance conditions.
  • Collaborate with security engineers to develop plans and automation for proactive response to new risks and vulnerabilities.
  • Provide architectural guidance to software development teams to enhance resiliency, efficiency, performance, and cost-effectiveness.
  • Implement and improve CI/CD pipelines to facilitate seamless and reliable software releases.
  • Participate in an on-call rotation to respond to incidents and ensure 24/7 system availability.

Preferred Qualifications

  • Proficiency in Linux, including experience compiling kernels, tracing syscalls, understanding TCP.
  • Knowledge of open-source software and contributions to the open-source community.
  • Familiarity with Rhapsody and various healthcare messaging standards, such as HL7 and FHIR.