Senior Site Reliability Engineer
Ahead of the Curve.
Agero is powering the next generation of software-enabled driver safety services and technology, pushing the limits of big data to transform the entire driving experience. The majority of leading vehicle manufacturers and insurance providers use Agero’s roadside assistance, accident management, dispatch, consumer affairs and telematics innovations to strengthen their businesses and create stronger, lasting connections with their customers. Together, we’re making driving smarter and safer for everyone.
DESCRIPTION SUMMARY:
As a member of the Site Reliability Team, the Senior Site Reliability Engineer will conceive and execute a blueprint aimed at increasing service availability, forecasting monitoring needs and requirements, and automating resolution of future issues. This is achieved by focusing on proactive and holistic approaches to continuously improving the customer experience.
KEY OUTCOMES:
- Lead service decomposition efforts and implement service analysis KPIs in Splunk ITSI.
- Integrate product specific DataDog APM monitoring with Splunk Enterprise logging systems.
- Integrate core AWS CloudWatch and Azure monitoring tools with Splunk centralized systems at Agero.
KEY COMPETENCIES:
- Decision Making
- Problem Solving
- Planning
- Focus on Results
- Continuous Improvement & Innovation
- Leading Others: Motivating\Delegating
- Teamwork & Collaboration
- Informing & Communicating
RESPONSIBILITES:
- Work to automate detection and resolution of recurring issues in the production environment.
- Leverage Splunk suite for critical monitoring solutions and integrations with other tools such as PagerDuty, Status Page, and JIRA/Cherwell.
- Formulate and execute strategic monitoring plans for the enterprise and provide ongoing gap analysis.
- Assist with design and development of monitoring systems, procedures and dashboards.
- Analyze problem areas and coordinate corrective actions before services are impacted.
- Perform and automate trend analysis of events and incidents to establish potential for future issues.
- Track and own unresolved issues and escalate to appropriate groups when necessary
- Communicate with product, software engineering, and tier1/2 support teams to align to SRE strategy and toolchains.
- Recommend stack design improvements to facilitate automated remediation of production events.
- Research, develop and introduce tools and methodologies to increase application uptime.
- Provide strategies for improving application platforms with a focus on reliability, stability, performance and total cost of ownership.
- Maintain understanding of industry best practices and leading-edge technologies and adopt as appropriate.
- Drive down inefficiencies and enhance cost savings for operational workflow across all platforms.
KNOWLEDGE, SKILLS AND ABILITIES:
EDUCATION: Bachelor's degree or equivalent education/experience. Advanced degree in related field a plus.
EXPERIENCE:
- Experience with scripting language like Python, PowerShell etc,.
- Experience with Splunk, AWS, and Infrastructure as Code (CloudFormation, Terraform, etc).
- Experience with Agile, Scrum and DevOps concepts., Configuration Mgmt.
- Ability to build, use and configure metrics collection, reporting and alerting systems.
- Experience working as a Site Reliability Engineer or a similar role operating a highly scalable and distributed platform.
- Experience in-depth system administration experience.
COMPLEXITY: Demonstrates initiative and self-motivation. Utilizes strong decision making skills, time management and communicates well with other team members. Identifies problems and recommends/implements solutions. Demonstrates ability to write clear, concise technical documentation.
WORKING RELATIONSHIPS: Works in a team environment on cross-functional teams. Must be able to work with both operations engineers and software developers. Works with peers to understand and take ownership of enterprise application reliability. Negotiates effectively with business stakeholders pros/cons of a particular technology approach. Ability to influence others and persuade consensus is key.
ADDITIONAL REQUIREMENTS: Performs other duties as assigned.
THIS DESCRIPTION IS NOT INTENDED TO BE A COMPLETE STATEMENT OF JOB CONTENT, RATHER TO ACT AS A GUIDE TO THE ESSENTIAL FUNCTIONS PERFORMED. MANAGEMENT RETAINS THE DISCRETION TO ADD TO OR CHANGE THE DUTIES OF THE POSITION AT ANY TIME.