Posted in

Service Reliability Analyst II – Itil

Service Reliability Analyst II – Itil

CompanyRiot Games
LocationLos Angeles, CA, USA
Salary$Not Provided – $Not Provided
TypeFull-Time
DegreesBachelor’s
Experience LevelJunior, Mid Level

Requirements

  • 2-4 years of hands-on experience in IT service management, data analysis, or technical operations, with a focus on maintaining and optimizing IT infrastructure.
  • Strong proficiency in incident, problem, change, and release management, with the ability to design and implement process flows using industry-standard methodologies.
  • Solid understanding of software development life cycles (SDLC) and how various components interact within larger ecosystems, ensuring seamless operation and scalability.
  • Clear awareness of system and service ownership within a multi-team environment, including the effective use of APIs/SDKs and adherence to SLAs.
  • Deep enthusiasm for operations and technology, with a proactive approach to continuous improvement in system reliability and performance.
  • ITIL-based Ticketing Systems: In-depth experience with ServiceNow, JIRA, or similar platforms for tracking and managing IT service processes.
  • Experience with the following tools and technologies: Advanced skills in Tableau, DataWrapper, and Excel for creating actionable insights from complex datasets; Proficient in JQL, SQL, and XQuery for querying and manipulating data across various platforms; Expertise in setting up and managing monitoring frameworks using tools like DataDog and NewRelic to ensure system health and performance; Skilled in Event Correlation to improve Incident Response with tools such as Datadog, Big Panda or PagerDuty.

Responsibilities

  • Lead and facilitate weekly technical discussions on service reliability with key product teams, ensuring alignment on operational goals and performance metrics.
  • Conduct thorough audits of incident data in collaboration with service owners to validate accuracy and ensure comprehensive reporting and analysis.
  • Collect, synthesize, and report on system health metrics for Riot’s diverse infrastructure, utilizing advanced data collection methods and monitoring tools.
  • Perform in-depth analysis of operational data trends to identify and address systemic issues and optimize service performance.
  • Participate in on-call rotations to provide critical support and ensure rapid response to incidents, minimizing downtime and service disruptions.
  • Assist in tracking and coordinating corrective actions for root cause analysis, ensuring thorough resolution of underlying issues and continuous improvement of operational processes.
  • Develop and maintain dashboards and reports that provide insights into key operational performance metrics, assisting leaders with making data-driven decisions.

Preferred Qualifications

  • 2+ years of specialized experience in Service Reliability Engineering (SRE) or equivalent roles such as Technical Release Manager, Process Owner, Live Operations Engineer, or Network Administrator.
  • Bachelor’s degree in Computer Science, IT Systems, Information Technology, or a closely related field, or equivalent professional experience.
  • Advanced data analysis and data insights proficiency, with the ability to derive actionable intelligence from large datasets.
  • Relevant certifications such as AWS Certified Solutions Architect, CompTIA Linux+, or CompTIA Network+, or equivalent credentials, are highly valued.
  • Demonstrated expertise in deploying and managing monitoring solutions such as DataDog and NewRelic to ensure system health and performance within complex environments.