Senior Systems Engineer - Site Reliability

Senior Systems Engineer – Site Reliability

Deep knowledge of Linux-based operating systems (Red Hat Enterprise Linux, CentOS, Rocky, Alma, SUSE, Ubuntu, Debian) including configuration and management, security, performance tuning/monitoring, and troubleshooting.
Proficiency in scripting and automation with Ansible, Bash, or other languages like Python, Terraform or PowerShell.
Thorough understanding of computer networking (DHCP, DNS, HTTP, TCP/UDP, IPv4/IPv6, OSI model) and load balancing.
Understands client-server architecture with relation to web/app servers, databases, load balancers, and other infrastructure platforms.

Monitor system performance metrics and logs to preemptively identify and resolve issues.
Engage in real-time troubleshooting of connections between databases and web servers, ensuring high availability and minimal downtime.
Develop, maintain, and evaluate automation scripts and playbooks for common infrastructure patterns and standards using Terraform, and custom scripts to ensure consistent and repeatable environments.
Implement and maintain monitoring solutions that provide insights into system health and performance.
Leverage tools like Dynatrace and Splunk to create actionable alerts and dashboards.
Regularly assess and tune the performance of Linux and other operating systems, ensuring secure configurations, and managing vulnerabilities to protect against threats.
Work closely with application development teams and other stakeholders to support and deploy new technologies that align with business goals.
Champion best practices and contribute to technology strategy discussions.
Produce high-quality documentation detailing the configuration, operation, and troubleshooting of supported systems and software.
Share knowledge with team members on best practices.
Participate in an on-call rotation to ensure that our critical systems function reliably around the clock.

Experience working in Amazon Web Services (AWS), Azure, or Google Cloud Platform (GCP).
Prior experience working in a customer-facing role and/or helpdesk.
Working knowledge of any of the following: VMware, Redis and/or other caching technologies, Splunk, Azure DevOps, Active Directory, PowerBI, Dynatrace, ServiceNow, or PagerDuty.
Knowledge of agile development and delivery processes.
Experience working with a continuous integration/continuous deployment (CI/CD) platform like CircleCI, Jenkins, Octopus Deploy, or GitHub Actions. Alternatively, experience with a job scheduling/orchestration platform like Rundeck or Tidal.
Working knowledge of other operating systems and platforms like Windows Server.
Background in deploying and managing container orchestration platforms like Kubernetes/Docker/ECS etc.