Senior Site Reliability Engineer
Company | Spreedly |
---|---|
Location | United States |
Salary | $Not Provided – $Not Provided |
Type | Full-Time |
Degrees | |
Experience Level | Senior |
Requirements
- Hands-on experience with Datadog, OpenTelemetry, Sentry, and Sumo Logic or similar monitoring and observability platforms, with a focus on actionable metrics and alerts.
- Proficiency in a modern programming language, with a proven ability to write clean, maintainable, and efficient code. Ruby, Rails, and Elixir experience are preferred.
- Experience with AWS services, including EC2 (Ubuntu Linux), S3, and RDS.
- In-depth knowledge of relational databases (e.g., CockroachDB, PostgreSQL, Riak) with experience in performance optimization and query tuning. Experience with Kafka is a plus.
- Experience applying design patterns to enhance reliability, scalability, and performance in application development.
- Excellent problem-solving skills with experience diagnosing complex system issues in production environments.
- Proven ability to work cross-functionally with product and application, infrastructure, and security engineering teams.
- Strong written and verbal communication skills, with the ability to explain complex technical concepts to non-technical stakeholders.
Responsibilities
- Ensure the reliability, availability, and performance of Spreedly’s globally distributed payments platform, processing $4B monthly production systems through monitoring, automation, and continuous improvement.
- Collaborate with development teams to improve the reliability and performance of Ruby on Rails and Elixir applications.
- Implement and maintain robust observability solutions using Datadog and OpenTelemetry, enabling proactive identification alerting, and resolution of issues.
- Lead incident response efforts by participating in a shared on-call rotation to maintain 24/7 system reliability, including root cause analysis, resolution, and implementing measures to prevent recurrence.
- Develop and maintain automation tools to reduce manual intervention, streamline operations, and enhance developer productivity.
- Monitor, analyze, and optimize the performance of relational databases, identifying and resolving bottlenecks to maintain data integrity and efficiency.
- Lead by example, infusing modern SRE best practices and fostering a culture of reliability and performance within the engineering organization.
- Provide technical guidance and mentorship to team members, fostering a culture of learning and collaboration.
Preferred Qualifications
- Ruby, Rails, and Elixir experience are preferred.
- Experience with Kafka is a plus.